The goal of this document is to describe the components of the Cilium architecture, and the different models for deploying Cilium within your datacenter or cloud environment. It focuses on the higher-level understanding required to run a full Cilium deployment.
A deployment of Cilium consists of the following components running on each Linux container node in the container cluster:
- Cilium Agent (Daemon): Userspace daemon that interacts with the container runtime and orchestration systems such as Kubernetes via Plugins to setup networking and security for containers running on the local server. Provides an API for configuring network security policies, extracting network visibility data, etc.
- Cilium CLI Client: Simple CLI client for communicating with the local Cilium Agent, for example, to configure network security or visibility policies.
- Linux Kernel BPF: Integrated capability of the Linux kernel to accept compiled bytecode that is run at various hook / trace points within the kernel. Cilium compiles BPF programs and has the kernel run them at key points in the network stack to have visibility and control over all network traffic in / out of all containers.
- Container Platform Network Plugin: Each container platform (e.g., Docker, Kubernetes) has its own plugin model for how external networking platforms integrate. In the case of Docker, each Linux node runs a process (cilium-docker) that handles each Docker libnetwork call and passes data / requests on to the main Cilium Agent.
In addition to these components, Cilium also depends on the following components running in the cluster:
- Key-Value Store: Cilium shares data between Cilium Agents on different nodes via a kvstore. The currently supported key-value stores are etcd or consul.
- Cilium Operator: Daemon for handling cluster management duties which can be handled once per cluster, rather than once per node.
The Cilium agent (cilium-agent) runs on each Linux container host. At a high-level, the agent accepts configuration that describes service-level network security and visibility policies. It then listens to events in the container runtime to learn when containers are started or stopped, and it creates custom BPF programs which the Linux kernel uses to control all network access in / out of those containers. In more detail, the agent:
- Exposes APIs to allow operations / security teams to configure security policies (see below) that control all communication between containers in the cluster. These APIs also expose monitoring capabilities to gain additional visibility into network forwarding and filtering behavior.
- Gathers metadata about each new container that is created. In particular, it queries identity metadata like container / pod labels, which are used to identify Endpoint in Cilium security policies.
- Interacts with the container platforms network plugin to perform IP address management (IPAM), which controls what IPv4 and IPv6 addresses are assigned to each container. The IPAM is managed by the agent in a shared pool between all plugins which means that the Docker and CNI network plugin can run side by side allocating a single address pool.
- Combines its knowledge about container identity and addresses with the already configured security and visibility policies to generate highly efficient BPF programs that are tailored to the network forwarding and security behavior appropriate for each container.
- Compiles the BPF programs to bytecode using clang/LLVM and passes them to the Linux kernel to run for all packets in / out of the container’s virtual ethernet device(s).
Cilium CLI Client¶
The Cilium CLI Client (cilium) is a command-line tool that is installed along with the Cilium Agent. It gives a command-line interface to interact with all aspects of the Cilium Agent API. This includes inspecting Cilium’s state about each network endpoint (i.e., container), configuring and viewing security policies, and configuring network monitoring behavior.
Linux Kernel BPF¶
Berkeley Packet Filter (BPF) is a Linux kernel bytecode interpreter originally introduced to filter network packets, e.g. tcpdump and socket filters. It has since been extended with additional data structures such as hashtable and arrays as well as additional actions to support packet mangling, forwarding, encapsulation, etc. An in-kernel verifier ensures that BPF programs are safe to run and a JIT compiler converts the bytecode to CPU architecture specific instructions for native execution efficiency. BPF programs can be run at various hooking points in the kernel such as for incoming packets, outgoing packets, system calls, kprobes, etc.
BPF continues to evolve and gain additional capabilities with each new Linux release. Cilium leverages BPF to perform core datapath filtering, mangling, monitoring and redirection, and requires BPF capabilities that are in any Linux kernel version 4.8.0 or newer. On the basis that 4.8.x is already declared end of life and 4.9.x has been nominated as a stable release we recommend to run at least kernel 4.9.17 (the latest current stable Linux kernel as of this writing is 4.10.x).
Cilium is capable of probing the Linux kernel for available features and will automatically make use of more recent features as they are detected.
Linux distros that focus on being a container runtime (e.g., CoreOS, Fedora Atomic) typically already ship kernels that are newer than 4.8, but even recent versions of general purpose operating systems such as Ubuntu 16.10 ship fairly recent kernels. Some Linux distributions still ship older kernels but many of them allow installing recent kernels from separate kernel package repositories.
For more detail on kernel versions, see: Linux Kernel.
The Key-Value (KV) Store is used for the following state:
- Policy Identities: list of labels <=> policy identity identifier
- Global Services: global service id to VIP association (optional)
- Encapsulation VTEP mapping (optional)
To simplify things in a larger deployment, the key-value store can be the same one used by the container orchestrator (e.g., Kubernetes using etcd).
The Cilium Operator is responsible for managing duties in the cluster which should logically be handled once for the entire cluster, rather than once for each node in the cluster. Its design helps with scale limitations in large kubernetes clusters (>1000 nodes). The responsibilities of Cilium operator include:
- Synchronizing kubernetes services with etcd for Cluster Mesh
- Synchronizing node resources with etcd
- Ensuring that DNS pods are managed by Cilium
- Garbage-collection of Cilium Endpoints resources
If Cilium loses connectivity with the KV-Store, it guarantees that:
- Normal networking operations will continue;
- If policy enforcement is enabled, the existing Endpoint will still have their policy enforced but you will lose the ability to add additional containers that belong to security identities which are unknown on the node;
- If services are enabled, you will lose the ability to add additional services / loadbalancers;
- When the connectivity is restored to the KV-Store, Cilium can take up to 5 minutes to re-sync the out-of-sync state with the KV-Store.
Cilium will keep running even if it is out-of-sync with the KV-Store.
If Cilium crashes / or the DaemonSet is accidentally deleted, the following are guaranteed:
- When running Cilium as a DaemonSet / container, with the specification files provided in the documentation Installation with external etcd, the endpoints / containers which are already running will not lose any connectivity, and they will keep running with the policy loaded before Cilium stopped unexpectedly.
- When running Cilium in a different way, just make sure the bpf fs is mounted Mounted BPF filesystem.
Labels are a generic, flexible and highly scalable way of addressing a large set of resources as they allow for arbitrary grouping and creation of sets. Whenever something needs to be described, addressed or selected, it is done based on labels:
- Endpoint are assigned labels as derived from the container runtime, orchestration system, or other sources.
- Network Policy select pairs of Endpoint which are allowed to communicate based on labels. The policies themselves are identified by labels as well.
What is a Label?¶
A label is a pair of strings consisting of a
value. A label can
be formatted as a single string with the format
key=value. The key portion
is mandatory and must be unique. This is typically achieved by using the
reverse domain name notion, e.g.
io.cilium.mykey=myvalue. The value portion
is optional and can be omitted, e.g.
Key names should typically consist of the character set
When using labels to select resources, both the key and the value must match,
e.g. when a policy should be applied to all endpoints with the label
my.corp.foo then the label
my.corp.foo=bar will not match the
A label can be derived from various sources. For example, an endpoint will derive the labels associated to the container by the local container runtime as well as the labels associated with the pod as provided by Kubernetes. As these two label namespaces are not aware of each other, this may result in conflicting label keys.
To resolve this potential conflict, Cilium prefixes all label keys with
source: to indicate the source of the label when importing labels, e.g.
k8s:role=backend. This means
that when you run a Docker container using
docker run [...] -l foo=bar, the
container:foo=bar will appear on the Cilium endpoint representing the
container. Similarly, a Kubernetes pod started with the label
will be represented with a Cilium endpoint associated with the label
k8s:foo=bar. A unique name is allocated for each potential source. The
following label sources are currently supported:
container:for labels derived from the local container runtime
k8s:for labels derived from Kubernetes
mesos:for labels derived from Mesos
reserved:for special reserved labels, see Special Identities.
unspec:for labels with unspecified source
When using labels to identify other resources, the source can be included to
limit matching of labels to a particular type. If no source is provided, the
label source defaults to
any: which will match all labels regardless of
their source. If a source is provided, the source of the selecting and matching
labels need to match.
Cilium makes application containers available on the network by assigning them
IP addresses. Multiple application containers can share the same IP address; a
typical example for this model is a Kubernetes
Pod. All application containers
which share a common address are grouped together in what Cilium refers to as
Allocating individual IP addresses enables the use of the entire Layer 4 port
range by each endpoint. This essentially allows multiple application containers
running on the same cluster node to all bind to well known ports such as
without causing any conflicts.
The default behavior of Cilium is to assign both an IPv6 and IPv4 address to
every endpoint. However, this behavior can be configured to only allocate an
IPv6 address with the
--enable-ipv4=false option. If both an IPv6 and IPv4
address are assigned, either address can be used to reach the endpoint. The
same behavior will apply with regard to policy rules, load-balancing, etc. See
Address Management for more details.
For identification purposes, Cilium assigns an internal endpoint id to all endpoints on a cluster node. The endpoint id is unique within the context of an individual cluster node.
An endpoint automatically derives metadata from the application containers associated with the endpoint. The metadata can then be used to identify the endpoint for security/policy, load-balancing and routing purposes.
The source of the metadata will depend on the orchestration system and container runtime in use. The following metadata retrieval mechanisms are currently supported:
|Kubernetes||Pod labels (via k8s API)|
|Mesos||Labels (via CNI)|
|containerd (Docker)||Container labels (via Docker API)|
Metadata is attached to endpoints in the form of Labels.
The following example launches a container with the label
which is then associated with the endpoint. The label is prefixed with
container: to indicate that the label was derived from the container
$ docker run --net cilium -d -l app=benchmark tgraf/netperf aaff7190f47d071325e7af06577f672beff64ccc91d2b53c42262635c063cf1c $ cilium endpoint list ENDPOINT POLICY IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS ENFORCEMENT 62006 Disabled 257 container:app=benchmark f00d::a00:20f:0:f236 10.15.116.202 ready
An endpoint can have metadata associated from multiple sources. A typical
example is a Kubernetes cluster which uses containerd as the container runtime.
Endpoints will derive Kubernetes pod labels (prefixed with the
prefix) and containerd labels (prefixed with
container: source prefix).
All Endpoint are assigned an identity. The identity is what is used to enforce basic connectivity between endpoints. In traditional networking terminology, this would be equivalent to Layer 3 enforcement.
An identity is identified by Labels and is given a cluster wide unique identifier. The endpoint is assigned the identity which matches the endpoint’s Security Relevant Labels, i.e. all endpoints which share the same set of Security Relevant Labels will share the same identity. This concept allows to scale policy enforcement to a massive number of endpoints as many individual endpoints will typically share the same set of security Labels as applications are scaled.
What is an Identity?¶
The identity of an endpoint is derived based on the Labels associated with the pod or container which are derived to the endpoint. When a pod or container is started, Cilium will create an endpoint based on the event received by the container runtime to represent the pod or container on the network. As a next step, Cilium will resolve the identity of the endpoint created. Whenever the Labels of the pod or container change, the identity is reconfirmed and automatically modified as required.
Security Relevant Labels¶
Not all Labels associated with a container or pod are meaningful when
deriving the Identity. Labels may be used to store metadata such as the
timestamp when a container was launched. Cilium requires to know which labels
are meaningful and are subject to being considered when deriving the identity.
For this purpose, the user is required to specify a list of string prefixes of
meaningful labels. The standard behavior is to include all labels which start
with the prefix
id.groupA.service44. The list of meaningful label prefixes can be specified
when starting the agent.
All endpoints which are managed by Cilium will be assigned an identity. In
order to allow communication to network endpoints which are not managed by
Cilium, special identities exist to represent those. Special reserved
identities are prefixed with the string
|reserved:unknown||The identity could not be derived.|
|reserved:host||The collection of all cluster hosts. Any traffic that originates from or is designated to one of the IPs of any host in the cluster is assigned the reserved:host identity.|
|reserved:world||Any network endpoint outside of the cluster|
|reserved:health||This is health checking traffic generated by Cilium agents.|
An endpoint for which the identity has not yet been resolved is assigned the init identity. This represents the phase of an endpoint in which some of the metadata required to derive the security identity is still missing. This is typically the case in the bootstrapping phase.
The init identity is only allocated if the labels of the endpoint are not known at creation time. This can be the case for the Docker plugin.
|reserved:unmanaged||An endpoint that is not managed by Cilium, e.g. a Kubernetes pod that was launched before Cilium was installed.|
The following is a list of well-known identities which Cilium is aware of automatically and will hand out a security identity without requiring to contact any external dependencies such as the kvstore. The purpose of this is to allow bootstrapping Cilium and enable network connectivity with policy enforcement in the cluster for essential services without depending on any dependencies.
|Deployment||Namespace||ServiceAccount||Cluster Name||Numeric ID||Labels|
cilium-cluster is not defined with the
the default value will be set to “
Identity Management in the Cluster¶
Identities are valid in the entire cluster which means that if several pods or containers are started on several cluster nodes, all of them will resolve and share a single identity if they share the identity relevant labels. This requires coordination between cluster nodes.
The operation to resolve an endpoint identity is performed with the help of the distributed key-value store which allows to perform atomic operations in the form generate a new unique identifier if the following value has not been seen before. This allows each cluster node to create the identity relevant subset of labels and then query the key-value store to derive the identity. Depending on whether the set of labels has been queried before, either a new identity will be created, or the identity of the initial query will be returned.
Cilium refers to a node as an individual member of a cluster. Each node must be
cilium-agent and will operate in a mostly autonomous manner.
Synchronization of state between Cilium agent’s running on different nodes is
kept to a minimum for simplicity and scale. It occurs exclusively via the
Key-Value store or with packet metadata.
Cilium will automatically detect the node’s IPv4 and IPv6 address. The detected
node address is printed out when the
Local node-name: worker0 Node-IPv6: f00d::ac10:14:0:1 External-Node IPv4: 172.16.0.20 Internal-Node IPv4: 10.200.28.238
The address management is designed with simplicity and resilience in mind. This is achieved by delegating the address allocation for endpoints to each individual node in the cluster. Each cluster node is assigned a node address allocation prefix out of an overarching cluster address prefix and will allocate IPs for endpoints independently.
This simplifies address handling and allows one to make a fundamental assumption:
- No state needs to be synchronized between cluster nodes to allocate IP addresses and to determine whether an IP address belongs to an endpoint of the cluster and whether that endpoint resides on the local cluster node.
If you are using Kubernetes, the allocation of the node address prefix
can be simply delegated to Kubernetes by specifying
--allocate-node-cidrs flag to
will automatically use the IPv4 node CIDR allocated by Kubernetes.
The following values are used by default if the cluster prefix is left unspecified. These are meant for testing and need to be adjusted according to the needs of your environment.
Note: Only 16 bits out of the
The size of the IPv4 cluster prefix can be changed with the
--ipv4-cluster-cidr-mask-size option. The size of the IPv6 cluster prefix
is currently fixed sized at
/48. The node allocation prefixes can be
specified manually with the option
Multi Host Networking¶
Cilium is in full control over both ends of the connection for connections inside the cluster. It can thus transmit state and security context information between two container hosts by embedding the information in encapsulation headers or even unused bits of the IPv6 packet header. This allows Cilium to transmit the security context of where the packet originates, which allows tracing back which container labels are assigned to the origin container.
As the packet headers contain security sensitive information, it is highly recommended to either encrypt all traffic or run Cilium in a trusted network environment.
Cilium keeps the networking concept as simple as possible. There are two networking models to choose from.
Regardless of the option chosen, the container itself has no awareness of the underlying network it runs on; it only contains a default route which points to the IP address of the cluster node. Given the removal of the routing cache in the Linux kernel, this reduces the amount of state to keep in the per connection flow cache (TCP metrics), which allows to terminate millions of connections in each container.
Overlay Network Mode¶
When no configuration is provided, Cilium automatically runs in this mode.
In this mode, all cluster nodes form a mesh of tunnels using the UDP based
Geneve. All container-to-container network
traffic is routed through these tunnels. This mode has several major
- Simplicity: The network which connects the cluster nodes does not need to be made aware of the cluster prefix. Cluster nodes can spawn multiple routing or link-layer domains. The topology of the underlying network is irrelevant as long as cluster nodes can reach each other using IP/UDP.
- Auto-configuration: When running together with an orchestration system
such as Kubernetes, the list of all nodes in the cluster including their
associated allocation prefix node is made available to each agent
automatically. This means that if Kubernetes is being run with the
--allocate-node-cidrsoption, Cilium can form an overlay network automatically without any configuration by the user. New nodes joining the cluster will automatically be incorporated into the mesh.
- Identity transfer: Encapsulation protocols allow for the carrying of arbitrary metadata along with the network packet. Cilium makes use of this ability to transfer metadata such as the source security identity and load balancing state to perform direct-server-return.
Direct / Native Routing Mode¶
This is an advanced networking mode which requires the underlying
network to be made aware of container IPs. You can enable this mode
by running Cilium with the option
In direct routing mode, Cilium will hand all packets which are not addressed for another local endpoint to the routing subsystem of the Linux kernel. This means that the packet will be routed as if a local process would have emitted the packet. As a result, the network connecting the cluster nodes must be aware that each of the node IP prefixes are reachable by using the node’s primary IP address as an L3 next hop address.
Cilium automatically enables IP forwarding in Linux when direct mode is configured, but it is up to the container cluster administrator to ensure that each routing element in the underlying network has a route that describes each node IP as the IP next hop for the corresponding node prefix.
This is typically achieved using two methods:
- Operation of a routing protocol such as OSPF or BGP via routing daemon such as Zebra, bird, bgpd. The routing protocols will announce the node allocation prefix via the node’s IP to all other nodes.
- Use of the cloud provider’s routing functionality. Refer to the documentation
of your cloud provider for additional details (e.g,. AWS VPC Route Tables
or GCE Routes). These APIs can be used to associate each node prefix with
the appropriate next hop IP each time a container node is added to the
cluster. If you are running Kubernetes with the
--cloud-providerin combination with the
--allocate-node-cidrsoption then this is configured automatically for IPv4 prefixes.
Use of direct routing mode with advanced policy use cases such as L7 policies is currently beta. Please provide feedback and file a GitHub issue if you experience any problems.
There are two possible approaches to performing network forwarding for container-to-container traffic:
Cluster mesh extends the networking datapath across multiple clusters. It allows endpoints in all connected clusters to communicate while providing full policy enforcement. Load-balancing is available via Kubernetes annotations.
See Setting up Cluster Mesh for instructions on how to set up cluster mesh.
Container Communication with External Hosts¶
Container communication with the outside world has two primary modes:
- Containers exposing API services for consumption by hosts outside of the container cluster.
- Containers making outgoing connections. Examples include connecting to 3rd-party API services like Twilio or Stripe as well as accessing private APIs that are hosted elsewhere in your enterprise datacenter or cloud deployment.
In the Direct / Native Routing Mode mode described before, if container IP addresses are routable outside of the container cluster, communication with external hosts requires little more than enabling L3 forwarding on each of the Linux nodes.
External Network Connectivity¶
If the destination of a packet lies outside of the cluster, Cilium will delegate routing to the routing subsystem of the cluster node to use the default route which is installed on the node of the cluster.
As the IP addresses used for the cluster prefix are typically allocated
from RFC1918 private address blocks and are not publicly routable. Cilium will
automatically masquerade the source IP address of all traffic that is leaving
the cluster. This behavior can be disabled by running
Public Endpoint Exposure¶
In direct routing mode, endpoint IPs can be publicly routable IPs and no additional action needs to be taken.
In overlay mode, endpoints that are accepting inbound connections from cluster external clients likely want to be exposed via some kind of load-balancing layer. Such a load-balancer will have a public external address that is not part of the Cilium network. This can be achieved by having a load-balancer container that both has a public IP on an externally reachable network and a private IP on a Cilium network. However, many container orchestration frameworks, like Kubernetes, have built in abstractions to handle this “ingress” load-balancing capability, which achieve the same effect that Cilium handles forwarding and security only for ‘’internal’’ traffic between different services.
Cilium provides security on multiple levels. Each can be used individually or combined together.
- Identity based Connectivity Access Control: Connectivity policies between endpoints (Layer 3),
e.g. any endpoint with label
role=frontendcan connect to any endpoint with label
- Restriction of accessible ports (Layer 4) for both incoming and outgoing
connections, e.g. endpoint with label
role=frontendcan only make outgoing connections on port 443 (https) and endpoint
role=backendcan only accept connections on port 443 (https).
- Fine grained access control on application protocol level to secure HTTP and
remote procedure call (RPC) protocols, e.g the endpoint with label
role=frontendcan only perform the REST API call
GET /userdata/[0-9]+, all other API interactions with
Currently on the roadmap, to be added soon:
- Authentication: Any endpoint which wants to initiate a connection to an
endpoint with the label
role=backendmust have a particular security certificate to authenticate itself before being able to initiate any connections. See GH issue 502 for additional details.
- Encryption: Communication between any endpoint with the label
role=frontendto any endpoint with the label
role=backendis automatically encrypted with a key that is automatically rotated. See GH issue 504 to track progress on this feature.
Identity based Connectivity Access Control¶
Container management systems such as Kubernetes deploy a networking model which assigns an individual IP address to each pod (group of containers). This ensures simplicity in architecture, avoids unnecessary network address translation (NAT) and provides each individual container with a full range of port numbers to use. The logical consequence of this model is that depending on the size of the cluster and total number of pods, the networking layer has to manage a large number of IP addresses.
Traditionally security enforcement architectures have been based on IP address
filters. Let’s walk through a simple example: If all pods with the label
role=frontend should be allowed to initiate connections to all pods with
role=backend then each cluster node which runs at least one pod
with the label
role=backend must have a corresponding filter installed
which allows all IP addresses of all
role=frontend pods to initiate a
connection to the IP addresses of all local
role=backend pods. All other
connection requests should be denied. This could look like this: If the
destination address is 10.1.1.2 then allow the connection only if the source
address is one of the following [10.1.2.2,10.1.2.3,188.8.131.52].
Every time a new pod with the label
either started or stopped, the rules on every cluster node which run any such
pods must be updated by either adding or removing the corresponding IP address
from the list of allowed IP addresses. In large distributed applications, this
could imply updating thousands of cluster nodes multiple times per second
depending on the churn rate of deployed pods. Worse, the starting of new
role=frontend pods must be delayed until all servers running
role=backend pods have been updated with the new security rules as
otherwise connection attempts from the new pod could be mistakenly dropped.
This makes it difficult to scale efficiently.
In order to avoid these complications which can limit scalability and
flexibility, Cilium entirely separates security from network addressing.
Instead, security is based on the identity of a pod, which is derived through
labels. This identity can be shared between pods. This means that when the
role=frontend pod is started, Cilium assigns an identity to that pod
which is then allowed to initiate connections to the identity of the
role=backend pod. The subsequent start of additional
only requires to resolve this identity via a key-value store, no action has to
be performed on any of the cluster nodes hosting
role=backend pods. The
starting of a new pod must only be delayed until the identity of the pod has
been resolved which is a much simpler operation than updating the security
rules on all other cluster nodes.
All security policies are described assuming stateful policy enforcement for
session based protocols. This means that the intent of the policy is to
describe allowed direction of connection establishment. If the policy allows
A => B then reply packets from
A are automatically allowed as
B is not automatically allowed to initiate connections to
A. If that outcome is desired, then both directions must be explicitly
Security policies may be enforced at ingress or egress. For ingress, this means that each cluster node verifies all incoming packets and determines whether the packet is allowed to be transmitted to the intended endpoint. Correspondingly, for egress each cluster node verifies outgoing packets and determines whether the packet is allowed to be transmitted to its intended destination.
In order to enforce identity based security in a multi host cluster, the identity of the transmitting endpoint is embedded into every network packet that is transmitted in between cluster nodes. The receiving cluster node can then extract the identity and verify whether a particular identity is allowed to communicate with any of the local endpoints.
Default Security Policy¶
If no policy is loaded, the default behavior is to allow all communication unless policy enforcement has been explicitly enabled. As soon as the first policy rule is loaded, policy enforcement is enabled automatically and any communication must then be white listed or the relevant packets will be dropped.
Similarly, if an endpoint is not subject to an L4 policy, communication from and to all ports is permitted. Associating at least one L4 policy to an endpoint will block all connectivity to ports unless explicitly allowed.
Orchestration System Specifics¶
Cilium regards each deployed
Pod as an endpoint with regards to networking and
security policy enforcement. Labels associated with pods can be used to define
the identity of the endpoint.
When two pods communicate via a service construct, then the labels of the origin pod apply to determine the identity.