Architecture Guide

The goal of this document is to describe the components of the Cilium architecture, and the different models for deploying Cilium within your datacenter or cloud environment. It focuses on the higher-level understanding required to run a full Cilium deployment. You can then use the more detailed Administrator Guide to understand the details of setting up Cilium.

Cilium Components

../_images/cilium-arch.png

A deployment of Cilium consists of the following components running on each Linux container node in the container cluster:

  • Cilium Agent (Daemon): Userspace daemon that interacts with the container runtime and orchestration systems such as Kubernetes via Plugins to setup networking and security for containers running on the local server. Provides an API for configuring network security policies, extracting network visibility data, etc.
  • Cilium CLI Client: Simple CLI client for communicating with the local Cilium Agent, for example, to configure network security or visibility policies.
  • Linux Kernel BPF: Integrated capability of the Linux kernel to accept compiled bytecode that is run at various hook / trace points within the kernel. Cilium compiles BPF programs and has the kernel run them at key points in the network stack to have visibility and control over all network traffic in / out of all containers.
  • Container Platform Network Plugin: Each container platform (e.g., Docker, Kubernetes) has its own plugin model for how external networking platforms integrate. In the case of Docker, each Linux node runs a process (cilium-docker) that handles each Docker libnetwork call and passes data / requests on to the main Cilium Agent.

In addition to the components that run on each Linux container host, Cilium leverages a key-value store to share data between Cilium Agents running on different nodes. The currently supported key-value stores are:

  • etcd
  • consul
  • local storage (golang hashmap)

Cilium Agent

The Cilium agent (cilium-agent) runs on each Linux container host. At a high-level, the agent accepts configuration that describes service-level network security and visibility policies. It then listens to events in the container runtime to learn when containers are started or stopped, and it creates custom BPF programs which the Linux kernel uses to control all network access in / out of those containers. In more detail, the agent:

  • Exposes APIs to allow operations / security teams to configure security policies (see below) that control all communication between containers in the cluster. These APIs also expose monitoring capabilities to gain additional visibility into network forwarding and filtering behavior.
  • Gathers metadata about each new container that is created. In particular, it queries identity metadata like container / pod labels, which are used to identify endpoints in Cilium security policies.
  • Interacts with the container platforms network plugin to perform IP address management (IPAM), which controls what IPv4 and IPv6 addresses are assigned to each container. The IPAM is managed by the agent in a shared pool between all plugins which means that the Docker and CNI network plugin can run side by side allocating a single address pool.
  • Combines its knowledge about container identity and addresses with the already configured security and visibility policies to generate highly efficient BPF programs that are tailored to the network forwarding and security behavior appropriate for each container.
  • Compiles the BPF programs to bytecode using clang/LLVM and passes them to the Linux kernel to run for all packets in / out of the container’s virtual ethernet device(s).

Cilium CLI Client

The Cilium CLI Client (cilium) is a command-line tool that is installed along with the Cilium Agent. It gives a command-line interface to interact with all aspects of the Cilium Agent API. This includes inspecting Cilium’s state about each network endpoint (i.e., container), configuring and viewing security policies, and configuring network monitoring behavior.

Linux Kernel BPF

Berkeley Packet Filter (BPF) is a Linux kernel bytecode interpreter originally introduced to filter network packets, e.g. tcpdump and socket filters. It has since been extended with additional data structures such as hashtable and arrays as well as additional actions to support packet mangling, forwarding, encapsulation, etc. An in-kernel verifier ensures that BPF programs are safe to run and a JIT compiler converts the bytecode to CPU architecture specific instructions for native execution efficiency. BPF programs can be run at various hooking points in the kernel such as for incoming packets, outgoing packets, system calls, kprobes, etc.

BPF continues to evolve and gain additional capabilities with each new Linux release. Cilium leverages BPF to perform core datapath filtering, mangling, monitoring and redirection, and requires BPF capabilities that are in any Linux kernel version 4.8.0 or newer. On the basis that 4.8.x is already declared end of life and 4.9.x has been nominated as a stable release we recommend to run at least kernel 4.9.17 (the latest current stable Linux kernel as of this writing is 4.10.x).

Cilium is capable of probing the Linux kernel for available features and will automatically make use of more recent features as they are detected.

Linux distros that focus on being a container runtime (e.g., CoreOS, Fedora Atomic) typically already ship kernels that are newer than 4.8, but even recent versions of general purpose operating systems such as Ubuntu 16.10 ship fairly recent kernels. Some Linux distributions still ship older kernels but many of them allow installing recent kernels from separate kernel package repositories.

For more detail on kernel versions, see: Linux Kernel.

Key-Value Store

The Key-Value (KV) Store is used for the following state:

  • Policy Identities: list of labels <=> policy identity identifier
  • Global Services: global service id to VIP association (optional)
  • Encapsulation VTEP mapping (optional)

To simplify things in a larger deployment, the key-value store can be the same one used by the container orchestrater (e.g., Kubernetes using etcd). In single node Cilium deployments used for basic testing / learning, Cilium can use a local store implemented as a golang hash map, avoiding the need to setup a dedicated KV store.

Labels

Labels are a generic, flexible and highly scaleable way of addressing a large set of resources as they allow for arbitrary grouping and creation of sets. Whenever something needs to be described, addressed or selected this is done based on labels:

  • Endpoints are assigned labels as derived from container runtime or the orchestration system.
  • Network policies select endpoints based on labels and allow consumers based on labels.
  • Network policies themselves are described and addressed by labels.

A label is a pair of strings consisting of a key and value. A label can be formatted as a single string with the format key=value. The key portion is mandatory and must be unique. This is typically achieved by using the reverse domain name notion, e.g. io.cilium.mykey=myvalue. The value portion is optional and can be omitted, e.g. io.cilium.mykey.

Key names should typically consist of the character set [a-z0-9-.].

When using labels to select resources, both the key and the value must match, e.g. when a policy should be applied to all endpoints with the label my.corp.foo then the label my.corp.foo=bar will not match the selector.

A label can be derived from various sources. For example, a Cilium endpoint will derive the labels associated to the container by the local container runtime as well as the labels associated with the pod as provided by Kubernetes. As these two label namespaces are not aware of each other, this may result in conflicting label keys.

To resolve this potential conflict, Cilium prefixes all label keys with source: to indicate the source of the label when importing labels, e.g. k8s:role=frontend, container:user=joe, k8s:role=backend. This means that when you run a Docker container using docker run [...] -l foo=bar, the label container:foo=bar will appear on the Cilium endpoint representing the container. Similiarly, a Kubernetes pod started with the label foo: bar will be represented with a Cilium endpoint associated with the label k8s:foo=bar. A unique name is allocated for each potential source. The following label sources are currently supported:

  • container: for labels derived from the local container runtime
  • k8s: for labels derived from Kubernetes
  • reserved: for special reserved labels, see Special Identities.
  • unspec: for labels with unspecified source

When using labels to identify other resources, the source can be included to limit matching of labels to a particular type. If no source is provided, the label source defaults to any: which will match all labels regardless of their source. If a source is provided, the source of the selecting and matching labels need to match.

Endpoints

Cilium makes application containers available on the network by assigning them IP addresses. Multiple application containers can share the same IP address; a typical example for this model is a Kubernetes Pod. All application containers which share a common address are grouped together in what Cilium refers to as an endpoint.

Allocating individual IP addresses enables the use of the entire Layer 4 port range by each endpoint. This essentially allows multiple application containers running on the same cluster node to all bind to well known ports such 80 without causing any conflicts.

The default behaviour of Cilium is to assign both an IPv6 and IPv4 address to every endpoint. However, this behaviour can be configured to only allocate an IPv6 address with the --disable-ipv4 option. If both an IPv6 and IPv4 address are assigned, either address can be used to reach the endpoint. The same behaviour will apply with regard to policy rules, load-balancing, etc. See address management for more details.

Identification

For identification purposes, Cilium assigns an endpoint id to all endpoints on a cluster node. The endpoint id is unique within the context of an individual cluster node.

Endpoint Metadata (Labels)

An endpoint automatically derives metadata from the application containers associated with the endpoint. The metadata can then be used to identify the endpoint for security/policy, load-balancing and routing purposes.

The source of the metadata will depend on the orchestration system and container runtime in use. The following metadata retrieval mechanisms are currently supported:

System Description
Kubernetes Pod labels (via k8s API)
Mesos Labels (via CNI)
containerd (Docker) Container labels (via Docker API)

Metadata is attached to endpoints in the form of Labels.

The following example launches a container with the label app=benchmark which is then associated with the endpoint. The label is prefixed with container: to indicate that the label was derived from the container runtime.

$ docker run --net cilium -d -l app=benchmark tgraf/netperf
aaff7190f47d071325e7af06577f672beff64ccc91d2b53c42262635c063cf1c
$  cilium endpoint list
ENDPOINT   POLICY        IDENTITY   LABELS (source:key[=value])   IPv6                   IPv4            STATUS
           ENFORCEMENT
62006      Disabled      257        container:app=benchmark       f00d::a00:20f:0:f236   10.15.116.202   ready

An endpoint can have metadata associated from multiple sources. A typical example is a Kubernetes cluster which uses containerd as the container runtime. Endpoints will derive Kubernetes pod labels (prefixed with the k8s: source prefix) and containerd labels (prefixed with container: source prefix).

Cluster Node

Cilium refers to a node as an individual member of a cluster. Each node must be running the cilium-agent and will operate in a mostly autonomous manner. Synchronization of state between Cilium agent’s running on different nodes is kept to a minimum for simplicity and scale. It occurs exclusively via the Key-Value store or with packet metadata.

Node Address

Cilium will automatically detect the node’s IPv4 and IPv6 address. The detected node address is printed out when the cilium-agent starts:

Local node-name: worker0
Node-IPv6: f00d::ac10:14:0:1
External-Node IPv4: 172.16.0.20
Internal-Node IPv4: 10.200.28.238

Address Management

The address management is designed with simplicity and resilience in mind. This is achieved by delegating the address allocation for endpoints to each individual node in the cluster. Each cluster node is assigned a node address allocation prefix out of an overarching cluster address prefix and will allocate IPs for endpoints independently.

This simplifies address handling and allows one to make a fundamental assumption:

  • No state needs to be synchronized between cluster nodes to allocate IP addresses and to determine whether an IP address belongs to an endpoint of the cluster and whether that endpoint resides on the local cluster node.

Note

If you are using Kubernetes, the allocation of the node address prefix can be simply delegated to Kubernetes by specifying --allocate-node-cidrs flag to kube-controller-manager. Cilium will automatically use the IPv4 node CIDR allocated by Kubernetes.

The following values are used by default if the cluster prefix is left unspecified. These are meant for testing and need to be adjusted according to the needs of your environment.

Type Cluster Node Prefix
IPv4 10.0.0.0/8 10.X.0.0/16 where X is derived using the last 8 bits of the first IPv4 address in the list of global scope addresses on the cluster node.
IPv6 f00d::/48

f00d:0:0:0:<ipv4-address>::/96 where the IPv4 address is the first address in the list of global scope addresses on the cluster node.

Note: Only 16 bits out of the /96 node prefix are currently used when allocating container addresses. This allows to use the remaining 16 bits to store arbitrary connection state when sending packets between nodes. A typical use case for the state is direct server return.

The size of the IPv4 cluster prefix can be changed with the --ipv4-cluster-cidr-mask-size option. The size of the IPv6 cluster prefix is currently fixed sized at /48. The node allocation prefixes can be specified manually with the option --ipv4-range respectively --ipv6-range.

Multi Host Networking

Cilium is in full control over both ends of the connection for connections inside the cluster. It can thus transmit state and security context information between two container hosts by embedding the information in encapsulation headers or even unused bits of the IPv6 packet header. This allows Cilium to transmit the security context of where the packet originates, which allows tracing back which container labels are assigned to the origin container.

Note

As the packet headers contain security sensitive information, it is higly recommended to either encrypt all traffic or run Cilium in a trusted network environment.

Cilium keeps the networking concept as simple as possible. There are two networking models to choose from.

Regardless of the option chosen, the container itself has no awareness of the underlying network it runs on; it only contains a default route which points to the IP address of the cluster node. Given the removal of the routing cache in the Linux kernel, this reduces the amount of state to keep in the per connection flow cache (TCP metrics), which allows to terminate millions of connections in each container.

Overlay Network Mode

When no configuration is provided, Cilium automatically runs in this mode.

In this mode, all cluster nodes form a mesh of tunnels using the UDP based encapsulation protocols VXLAN or Geneve. All container-to-container network traffic is routed through these tunnels. This mode has several major advantages:

  • Simplicity: The network which connects the cluster nodes does not need to be made aware of the cluster prefix. Cluster nodes can spawn multiple routing or link-layer domains. The topology of the underlying network is irrelevant as long as cluster nodes can reach each other using IP/UDP.
  • Auto-configuration: When running together with an orchestration system such as Kubernetes, the list of all nodes in the cluster including their associated allocation prefix node is made available to each agent automatically. This means that if Kubernetes is being run with the –allocate-node-cidrs option, Cilium can form an overlay network automatically without any configuration by the user. New nodes joining the cluster will automatically be incorporated into the mesh.
  • Identity transfer: Encapsulation protocols allow for the carrying of arbitrary metadata along of with the network packet. Cilium makes use of this ability to transfer metadata such as the source security identity and load balancing state to perform direct-server-return.

Direct / Native Routing Mode

Note

This is an advanced networking mode which requires the underlying network to be made aware of container IPs. You can enable this mode by running Cilium with the option --tunnel disabled.

In direct routing mode, Cilium will hand all packets which are not addressed for another local endpoint to the routing subsystem of the Linux kernel. This means that the packet will be routed as if a local process would have emitted the packet. As a result, the network connecting the cluster nodes must be aware that each of the node IP prefixes are reachable by using the node’s primary IP address as an L3 next hop address. This is typically achieved using two methods:

  • Operation of a routing protocol such as OSPF or BPG via routing daemon such as Zebra, bird, bpgd. The routing protocols will announce the node allocation prefix via the node’s IP to all other nodes.
  • Use of the cloud provider’s routing functionality. Refer to the documentation of your cloud provider for additional details. If you are running Kubernetes with the –cloud-provider in combination with the –allocate-node-cidrs option then this is configured automatically for IPv4 prefixes.

Note

Use of direct routing mode currently only offers identity based security policy enforcement for IPv6 where the security identity is stored in the flowlabel. IPv4 is currently not supported and thus security must be enforced using CIDR policy rules.

There are two possible approaches to performing network forwarding for container-to-container traffic:

Container Communication with External Hosts

Container communication with the outside world has two primary modes:

  • Containers exposing API services for consumption by hosts outside of the container cluster.
  • Containers making outgoing connections. Examples include connecting to 3rd-party API services like Twillio or Stripe as well as accessing private APIs that are hosted elsewhere in your enterprise datacenter or cloud deployment.

In the Direct / Native Routing Mode mode described before, if container IP addresses are routable outside of the container cluster, communication with external hosts requires little more than enabling L3 forwarding on each of the Linux nodes.

External Network Connectivity

If the destination of a packet lies outside of the cluster, Cilium will delegate routing to the routing subsystem of the cluster node to use the default route which is installed on the node of the cluster.

As the IP addresses used for the cluster prefix are typically allocated from RFC1918 private address blocks and are not publicly routable. Cilium will automatically masquerade the source IP address of all traffic that is leaving the cluster. This behaviour can be disabled by running cililum-agent with the option --masquerade=false.

Public Endpoint Exposure

In direct routing mode, endpoint IPs can be publicly routable IPs and no additional action needs to be taken.

In overlay mode, endpoints that are accepting inbound connections from cluster external clients likely want to be exposed via some kind of load-balancing layer. Such a load-balancer will have a public external address that is not part of the Cilium network. This can be achieved by having a load-balancer container that both has a public IP on an externally reachable network and a private IP on a Cilium network. However, many container orchestration frameworks, like Kubernetes, have built in abstractions to handle this “ingress” load-balancing capability, which achieve the same effect that Cilium handles forwarding and security only for ‘’internal’’ traffic between different services.

Security

Cilium provides security on multiple levels. Each can be used individually or combined together.

  • Identity based Connectivity Access Control: Connectivity policies between endpoints (Layer 3), e.g. any endpoint with label role=frontend can connect to any endpoint with label role=backend.
  • Restriction of accessible ports (Layer 4) for both incoming and outgoing connections, e.g. endpoint with label role=frontend can only make outgoing connections on port 443 (https) and endpoint role=backend can only accept connections on port 443 (https).
  • Fine grained access control on application protocol level to secure HTTP and remote procedure call (RPC) protocols, e.g the endpoint with label role=frontend can only perform the REST API call GET /userdata/[0-9]+, all other API interactions with role=backend are restricted.

Currently on the roadmap, to be added soon:

  • Authentication: Any endpoint which wants to initiate a connection to an endpoint with the label role=backend must have a particular security certificate to authenticate itself before being able to initiate any connections. See GH issue 502 for additional details.
  • Encryption: Communication between any endpoint with the label role=frontend to any endpoint with the label role=backend is automatically encrypted with a key that is automatically rotated. See GH issue 504 to track progress on this feature.

Identity based Connectivity Access Control

Container management systems such as Kubernetes deploy a networking model which assigns an individual IP address to each pod (group of containers). This ensures simplicity in architecture, avoids unnecessary network address translation (NAT) and provides each individual container with a full range of port numbers to use. The logical consequence of this model is that depending on the size of the cluster and total number of pods, the networking layer has to manage a large number of IP addresses.

Traditionally security enforcement architectures have been based on IP address filters. Let’s walk through a simple example: If all pods with the label role=frontend should be allowed to initiate connections to all pods with the label role=backend then each cluster node which runs at least one pod with the label role=backend must have a corresponding filter installed which allows all IP addresses of all role=frontend pods to initiate a connection to the IP addresses of all local role=backend pods. All other connection requests should be denied. This could look like this: If the destination address is 10.1.1.2 then allow the connection only if the source address is one of the following [10.1.2.2,10.1.2.3,20.4.9.1].

Every time a new pod with the label role=frontend or role=backend is either started or stopped, the rules on every cluster node which run any such pods must be updated by either adding or removing the corresponding IP address from the list of allowed IP addresses. In large distributed applications, this could imply updating thousands of cluster nodes multiple times per second depending on the churn rate of deployed pods. Worse, the starting of new role=frontend pods must be delayed until all servers running role=backend pods have been updated with the new security rules as otherwise connection attempts from the new pod could be mistakenly dropped. This makes it difficult to scale efficiently.

In order to avoid these complications which can limit scalability and flexibility, Cilium entirely separates security from network addressing. Instead, security is based on the identity of a pod, which is derived through labels. This identity can be shared between pods. This means that when the first role=frontend pod is started, Cilium assigns an identity to that pod which is then allowed to initiate connections to the identity of the role=backend pod. The subsequent start of additional role=frontend pods only requires to resolve this identity via a key-value store, no action has to be performed on any of the cluster nodes hosting role=backend pods. The starting of a new pod must only be delayed until the identity of the pod has been resolved which is a much simpler operation than updating the security rules on all other cluster nodes.

../_images/identity.png

What is an Endpoint Identity?

The identity of an endpoint is derived based on the labels associated with the pod or container. When a pod or container is started, Cilium will create an endpoint based on the event received by the container runtime to represent the pod or container on the network. As a next step, Cilium will resolve the identity of the endpoint created. Whenever the labels of the pod or container change, the identity is reconfirmed and automatically modified as required.

Not all labels associated with a container or pod are meaningful when deriving the security identity. Labels may be used to store metadata such as the timestamp when a container was launched. Cilium requires to know which labels are meaningful and are subject to being considered when deriving the identity. For this purpose, the user is required to specify a list of string prefixes of meaningful labels. The standard behavior is to include all labels which start with the prefix id., e.g. id.service1, id.service2, id.groupA.service44. The list of meaningful label prefixes can be specified when starting the cilium agent, see Command Line Options.

Special Identities

All endpoints which are managed by Cilium will be assigned an identity. In order to allow communication to network endpoints which are not managed by Cilium, special identities exist to represent those. Special reserved identities are prefixed with the string reserved:.

Identity Description
reserved:host The host network namespace on which the pod or container is running.
reserved:world Any network endpoint outside of the cluster

TODO: Document cidr: identity once implemented.

Identity Management in the Cluster

Identities are valid in the entire cluster which means that if several pods or containers are started on several cluster nodes, all of them will resolve and share a single identity if they share the identity relevant labels. This requires coordination between cluster nodes.

../_images/identity_store.png

The operation to resolve an endpoint identity is performed with the help of the distributed key-value store which allows to perform atomic operations in the form generate a new unique identifier if the following value has not been seen before. This allows each cluster node to create the identity relevant subset of labels and then query the key-value store to derive the identity. Depending on whether the set of labels has been queried before, either a new identity will be created, or the identity of the initial query will be returned.

Policy Enforcement

All security policies are described assuming stateful policy enforcement for session based protocols. This means that the intent of the policy is to describe allowed direction of connection establishment. If the policy allows A => B then reply packets from B to A are automatically allowed as well. However, B is not automatically allowed to initiate connections to A. If that outcome is desired, then both directions must be explicitly allowed.

Security policies are primarily enforced at ingress which means that each cluster node verifies all incoming packets and determines whether the packet is allowed to be transmitted to the intended endpoint. Policy enforcement also occurs at egress if required by the specific policy, e.g. a Layer 7 policy restricting outgoing API calls.

Layer 3 policies are currently not enforced at egress to avoid the complexity of resolving the destination endpoint identity before sending out the packet. Instead, the identity of the source endpoint is embedded into the packet.

In order to enforce identity based security in a multi host cluster, the identity of the transmitting endpoint is embedded into every network packet that is transmitted in between cluster nodes. The receiving cluster node can then extract the identity and verify whether a particular identity is allowed to communicate with any of the local endpoints.

Default Security Policy

If no policy is loaded, the default behaviour is to allow all communication unless policy enforcement has been explicitly enabled. As soon as the first policy rule is loaded, policy enforcement is enabled automatically and any communication must then be white listed or the relevant packets will be dropped.

Similarly, if an endpoint is not subject to an L4 policy, communication from and to all ports is permitted. Associating at least one L4 policy to an endpoint will block all connectivity to ports unless explicitly allowed.

Orchestration System Specifics

Kubernetes

Cilium regards each deployed Pod as an endpoint with regards to networking and security policy enforcement. Labels associated with pods can be used to define the identity of the endpoint.

When two pods communicate via a service construct, then the labels of the origin pod apply to determine the identity.

Policy Language

The security policy can be specified in the following formats:

  • The Kubernetes NetworkPolicy specification which offers to configure a subset of the full Cilium security. For fun see Kubernetes Network Policies for details on how to configure Kubernetes network policies. It is possible to define base rules using the Kubernetes specification and then extend these using additional Cilium specific rules.
  • The Cilium policy language as described below. In addition to what the Kubernetes NetworkPolicy spec supports, the Cilium language allows to implement Layer 7 filtering, deny rules, and hierarchical rules for delegation and precedence purposes. Cilium also provides egress enforcement for Layer 4 and Layer 7 rules.

The data format used by the Cilium policy language is JSON. Additional formats may be supported in the future.

Policy consists of a list of rules:

{
        "rules": [{ rule1, rule2, rule3 }]
}

Policy Rules

Multiple types of policy rules are supported, all types following the simple template:

  • coverage: A list of labels which the endpoint must carry.
  • rule: A type specific rule, the following rule types have been implemented:
    • Allow/Requires: Connectivity policy, e.g. allow a pod to talk to another pod
    • L4 L4 connectivity policy

Example:

The following example describes a rule which applies to all endpoints which carry the label backend.

[{
        "coverage": ["role=backend"],
        "allow": allowData
}]

Allow Rules

This is the simplest rule type. The rule defines a list of labels which are allowed to consume whatever endpoints are covered by the coverage.

If an endpoint transmits to another endpoint and the communication is not permitted by at least one allow rule, all packets of the connection will be dropped.

Note

Packet drops can be introspected by running the cilium monitor tool which logs each dropped packet including metadata such as the reason (policy denied) and the source and destination identity.

Field Type Description
coverage Array of labels List of labels that must match in order for this rule to be applied.
allow Array of allows List of labels which are allowed to initiate a connection to any endpoint covered by coverage.

allow:

Field Type Description
action string { “accept”, “always-accept”, “deny” }
label label Allowed or denied label

A short form is available as alternative to the above verbose JSON syntax:

Field Type Description
coverage Array of strings List of labels that must match in order for this rule to be applied.
allow Array of strings List of labels which are allowed to initiate a connection to any endpoint covered by coverage. The action is “accept” unless the label has the prefix ! in which case the action is “deny”.

Example:

The following simple example using the form allows pods with the label role=frontend to consume pods with the label role=backend:

[{
        "coverage": ["role=backend"],
        "allow": ["role=frontend"]
}]

The following example using the short form allows all pods with the label role=frontend to consume pods with the label role=backend unless the frontend pod carries the label user=joe:

[{
        "coverage": ["role=backend"],
        "allow": ["role=frontend", "!user=joe"]
}]

The special always-accept action is useful in combination with hierarchical policy trees. It allows to define allow rules which cannot be overruled by child policy nodes. See Policy Repository for additional information on policy tree and their precedence model.

The following example shows a child node role, which contains a rule that disallows access from role=frontend to role=backend. However, the parent node root allows access by using always-accept.

{
        "name": "root",
        "rules": [{
                "coverage": ["role=backend"],
                "allow": [{
                        "action": "always-accept",
                        "label": { "key": "role=frontend" }
                }]
        }],
        "children": {
                "role": {
                        "rules": [{
                                "coverage": ["role=backend"],
                                "allow": ["!role=frontend"]
                        }]
                }
        }
}

Requires Rules

Requires rules define a list of additional labels that must be present in the sending endpoint for an allow rule to take effect. A requires rule itself does not grant permissions for consumption; It merely imposes additional constraints. At least one allow rule is always required.

Field Type Description
coverage Array of labels List of labels that must match in order for this rule to be applied.
requires Array of labels List of labels that must be present in any transmitting endpoint desiring to connect to any endpoint covered by coverage.

If an endpoint transmits to another endpoint and the communication is not permitted because at least one of the required labels is not present, then the applied behaviour would be the same as if it lacks an allow rule.

[{
        "coverage": ["role=backend"],
        "allow": ["role=frontend"]
},
{
        "coverage": ["env=qa"],
        "requires": ["env=qa"]
},
{
        "coverage": ["env=prod"],
        "requires": ["env=prod"]
}]

The example above extends the existing allow rule with two additional requires rules. The first rule says that if an endpoint carries the label env=qa then the consuming endpoint also needs to carry the label env=qa. The second rule does the same for the label env=prod. The requires rules allows for simple segmentation of existing rules into multiple environments or groups.

Layer 4 Rules

The L4 rule allows to impose Layer 4 restrictions on endpoints. It can be applied to either incoming or outgoing connections. An L4 by itself does not allow communication, it must be combined with an allow rule to establish basic connectivity.

Field Type Description
coverage Array of labels List of labels that must match in order for this rule to be applied.
in-ports Array of l4-policy Layer 4 policy for any incoming connection to an endpoint covered by coverage.
out-ports Array of l4-policy Layer 4 policy for any outgoing connection from an endpoint covered by coverage.

l4-policy:

Field Type Description
port integer Allowed destination port
protocol string Allowed protocol {“tcp”, “udp”} (optional)
l7-parser string Name of Layer 7 parser. If set, causes traffic to be inspected based on rules. (optional)
l7-rules Array of string Array of rules passed into Layer 7 parser (optional). See Layer 7 Rules

The following example shows how to restrict Layer 4 communication of any endpoint carrying the label role=frontend and restrict incoming connections to TCP on port 80 or port 443. Outgoing connections must also be TCP and are restricted to port 8080.

[{
        "coverage": ["role=frontend"],
        "l4": [{
                "in-ports": [
                        { "port": 80, "protocol": "tcp" },
                        { "port": 443, "protocol": "tcp" }
                ],
                "out-ports": [{
                        "port": 8080, "protocol": "tcp"
                }]
        }]
}]

Layer 7 Rules

Layer 7 rules are currently limited to IPv4. Policies can be applied for both incoming and outgoing requests. The enforcement point is defined by the location of the rules in either the “in-ports” or “out-ports” field of the Layer 4 policy rule.

Unlike Layer 3 and Layer 4 policies, violation of Layer 7 rules does not result in packet drops. Instead, if possible, an access denied message such as an HTTP 403 access denied is sent back to the sending endpoint.

TODO: describe rules

Policy Repository

Policy rules imported into the Cilium agent are not shared with other compute nodes and are only enforced within the boundaries of the compute node. In order to enforce security policies across an entire cluster, one of the following options can be applied to distribute security policies across all cluster nodes:

  • Use of Kubernetes NetworkPolicy objects to define the policy. NetworkPolicy objects are automatically distributed to all worker nodes and the Cilium agent will import them automatically. (TODO: Describe option to use third-party objects to distribute native Cilium policy).
  • Use of a configuration management system such as chef, puppet, ansible, cfengine to automatically import a policy into all agents. (TODO: link to guide as soon as one exists.)
  • Use of a git tree to maintain the policy in combination with a post-merge hook which automatically imports the policy. (TODO: Write & link to guide)
  • Use of a distributed filesystem shared across all cluster node in combination with a filesystem watcher that invokes cilium import upon detection of any change.

Integration with Container Platforms

Cilium is deeply integrated with container platforms like Docker or Kubernetes. This enables Cilium to perform network forwarding and security using a model that maps direction to notions of identity (e.g., labels) and service abstractions that are native to the container platform.

In this section, we will provide more detail on how Cilium integrates with Docker and Kubernetes.

Docker supports network plugins via the libnetwork plugin interface .

When using Cilium with Docker, one creates a single logical Docker network of type cilium and with an IPAM-driver of type cilium, which delegates control over IP address management and network connectivity to Cilium for all containers attached to this network for both IPv4 and IPv6 connectivity. Each Docker container gets an IP address from the node prefix of the node running the container.

When deployed with Docker, each Linux node runs a cilium-docker agent, which receives libnetwork calls from Docker and then communicates with the Cilium Agent to control container networking.

Security policies controlling connectivity between the Docker containers can be written in terms of the Docker container labels passed to Docker while creating the container. These policies can be created/updated via communication directly with the Cilium agent, either via API or by using the Cilium CLI client.

When deployed with Kubernetes, Cilium provides four core Kubernetes networking capabilities:

  • Direct pod-to-pod network inter-connectivity.
  • Service-based load-balancing for pod-to-pod inter-connectivity (i.e., a kube-proxy replacement).
  • Identity-based security policies for all (direct and service-based) Pod-to-Pod inter-connectivity.
  • External-to-Pod service-based load-balancing (referred to as Ingress in Kubernetes)

The Kubernetes documentation contains more background on the Kubernetes Networking Model and Kubernetes Network Plugins .

In Kubernetes, containers are deployed within units referred to as Pods, which include one or more containers reachable via a single IP address. With Cilium, each Pod gets an IP address from the node prefix of the Linux node running the Pod. In the absence of any network security policies, all Pods can reach each other.

Pod IP addresses are typically local to the Kubernetes cluster. If pods need to reach services outside the cluster as a client, the Kubernetes nodes are typically configured to IP masquerade all traffic sent from containers to external prefix.

Kubernetes has developed the Services abstraction which provides the user the ability to load balance network traffic to different pods. This abstraction allows the pods reaching out to other pods by a single IP address, a virtual IP address, without knowing all the pods that are running that particular service.

Without Cilium, kube-proxy is installed on every node, watches for endpoints and services addition and removal on the kube-master which allows it to to apply the necessary enforcement on iptables. Thus, the received and sent traffic from and to the pods are properly routed to the node and port serving for that service. For more information you can check out the kubernetes user guide for Services.

Cilium load-balancer acts on the same principles as kube-proxy, it watches for services addition or removal, but instead of doing the enforcement on the iptables, it updates BPF map entries on each node. For more information, see the Pull Request.

TODO: describe benefits of BPF based load-balancer compared to kube-proxy
iptables

TODO: Verify this

Kubernetes supports an abstraction known as Ingress that allows a Pod-based Kubernetes service to expose itself for access outside of the cluster in a load-balanced way. In a typical setup, the external traffic would be sent to a publicly reachable IP + port on the host running the Kubernetes master, and then be load-balanced to the pods implementing the current service within the cluster.

Cilium supports Ingress with TCP-based load-balancing. Moreover, it supports ‘’direct server return’‘, meaning that reply traffic from the pod to the external client is sent directly, without needing to pass through the kubernetes master host.

TODO: insert graphic showing LB + DSR.