Program Types
At the time of this writing, there are eighteen different BPF program types available, two of the main types for networking are further explained in below subsections, namely XDP BPF programs as well as tc BPF programs. Extensive usage examples for the two program types for LLVM, iproute2 or other tools are spread throughout the toolchain section and not covered here. Instead, this section focuses on their architecture, concepts and use cases.
XDP
XDP stands for eXpress Data Path and provides a framework for BPF that enables high-performance programmable packet processing in the Linux kernel. It runs the BPF program at the earliest possible point in software, namely at the moment the network driver receives the packet.
At this point in the fast-path the driver just picked up the packet from its
receive rings, without having done any expensive operations such as allocating
an skb
for pushing the packet further up the networking stack, without
having pushed the packet into the GRO engine, etc. Thus, the XDP BPF program
is executed at the earliest point when it becomes available to the CPU for
processing.
XDP works in concert with the Linux kernel and its infrastructure, meaning the kernel is not bypassed as in various networking frameworks that operate in user space only. Keeping the packet in kernel space has several major advantages:
XDP is able to reuse all the upstream developed kernel networking drivers, user space tooling, or even other available in-kernel infrastructure such as routing tables, sockets, etc in BPF helper calls itself.
Residing in kernel space, XDP has the same security model as the rest of the kernel for accessing hardware.
There is no need for crossing kernel / user space boundaries since the processed packet already resides in the kernel and can therefore flexibly forward packets into other in-kernel entities like namespaces used by containers or the kernel’s networking stack itself. This is particularly relevant in times of Meltdown and Spectre.
Punting packets from XDP to the kernel’s robust, widely used and efficient TCP/IP stack is trivially possible, allows for full reuse and does not require maintaining a separate TCP/IP stack as with user space frameworks.
The use of BPF allows for full programmability, keeping a stable ABI with the same ‘never-break-user-space’ guarantees as with the kernel’s system call ABI and compared to modules it also provides safety measures thanks to the BPF verifier that ensures the stability of the kernel’s operation.
XDP trivially allows for atomically swapping programs during runtime without any network traffic interruption or even kernel / system reboot.
XDP allows for flexible structuring of workloads integrated into the kernel. For example, it can operate in “busy polling” or “interrupt driven” mode. Explicitly dedicating CPUs to XDP is not required. There are no special hardware requirements and it does not rely on hugepages.
XDP does not require any third party kernel modules or licensing. It is a long-term architectural solution, a core part of the Linux kernel, and developed by the kernel community.
XDP is already enabled and shipped everywhere with major distributions running a kernel equivalent to 4.8 or higher and supports most major 10G or higher networking drivers.
As a framework for running BPF in the driver, XDP additionally ensures that
packets are laid out linearly and fit into a single DMA’ed page which is
readable and writable by the BPF program. XDP also ensures that additional
headroom of 256 bytes is available to the program for implementing custom
encapsulation headers with the help of the bpf_xdp_adjust_head()
BPF helper
or adding custom metadata in front of the packet through bpf_xdp_adjust_meta()
.
The framework contains XDP action codes further described in the section below which a BPF program can return in order to instruct the driver how to proceed with the packet, and it enables the possibility to atomically replace BPF programs running at the XDP layer. XDP is tailored for high-performance by design. BPF allows to access the packet data through ‘direct packet access’ which means that the program holds data pointers directly in registers, loads the content into registers, respectively writes from there into the packet.
The packet representation in XDP that is passed to the BPF program as the BPF context looks as follows:
struct xdp_buff {
void *data;
void *data_end;
void *data_meta;
void *data_hard_start;
struct xdp_rxq_info *rxq;
};
data
points to the start of the packet data in the page, and as the
name suggests, data_end
points to the end of the packet data. Since XDP
allows for a headroom, data_hard_start
points to the maximum possible
headroom start in the page, meaning, when the packet should be encapsulated,
then data
is moved closer towards data_hard_start
via bpf_xdp_adjust_head()
.
The same BPF helper function also allows for decapsulation in which case
data
is moved further away from data_hard_start
.
data_meta
initially points to the same location as data
but
bpf_xdp_adjust_meta()
is able to move the pointer towards data_hard_start
as well in order to provide room for custom metadata which is invisible to
the normal kernel networking stack but can be read by tc BPF programs since
it is transferred from XDP to the skb
. Vice versa, it can remove or reduce
the size of the custom metadata through the same BPF helper function by
moving data_meta
away from data_hard_start
again. data_meta
can
also be used solely for passing state between tail calls similarly to the
skb->cb[]
control block case that is accessible in tc BPF programs.
This gives the following relation respectively invariant for the struct xdp_buff
packet pointers: data_hard_start
<= data_meta
<= data
< data_end
.
The rxq
field points to some additional per receive queue metadata which
is populated at ring setup time (not at XDP runtime):
struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
u32 reg_state;
} ____cacheline_aligned;
The BPF program can retrieve queue_index
as well as additional data
from the netdevice itself such as ifindex
, etc.
BPF program return codes
After running the XDP BPF program, a verdict is returned from the program in
order to tell the driver how to process the packet next. In the linux/bpf.h
system header file all available return verdicts are enumerated:
enum xdp_action {
XDP_ABORTED = 0,
XDP_DROP,
XDP_PASS,
XDP_TX,
XDP_REDIRECT,
};
XDP_DROP
as the name suggests will drop the packet right at the driver
level without wasting any further resources. This is in particular useful
for BPF programs implementing DDoS mitigation mechanisms or firewalling in
general. The XDP_PASS
return code means that the packet is allowed to
be passed up to the kernel’s networking stack. Meaning, the current CPU
that was processing this packet now allocates a skb
, populates it, and
passes it onwards into the GRO engine. This would be equivalent to the
default packet handling behavior without XDP. With XDP_TX
the BPF program
has an efficient option to transmit the network packet out of the same NIC it
just arrived on again. This is typically useful when few nodes are implementing,
for example, firewalling with subsequent load balancing in a cluster and
thus act as a hairpinned load balancer pushing the incoming packets back
into the switch after rewriting them in XDP BPF. XDP_REDIRECT
is similar
to XDP_TX
in that it is able to transmit the XDP packet, but through
another NIC. Another option for the XDP_REDIRECT
case is to redirect
into a BPF cpumap, meaning, the CPUs serving XDP on the NIC’s receive queues
can continue to do so and push the packet for processing the upper kernel
stack to a remote CPU. This is similar to XDP_PASS
, but with the ability
that the XDP BPF program can keep serving the incoming high load as opposed
to temporarily spend work on the current packet for pushing into upper
layers. Last but not least, XDP_ABORTED
which serves denoting an exception
like state from the program and has the same behavior as XDP_DROP
only
that XDP_ABORTED
passes the trace_xdp_exception
tracepoint which
can be additionally monitored to detect misbehavior.
Use cases for XDP
Some of the main use cases for XDP are presented in this subsection. The list is non-exhaustive and given the programmability and efficiency XDP and BPF enables, it can easily be adapted to solve very specific use cases.
DDoS mitigation, firewalling
One of the basic XDP BPF features is to tell the driver to drop a packet with
XDP_DROP
at this early stage which allows for any kind of efficient network policy enforcement with having an extremely low per-packet cost. This is ideal in situations when needing to cope with any sort of DDoS attacks, but also more general allows to implement any sort of firewalling policies with close to no overhead in BPF e.g. in either case as stand alone appliance (e.g. scrubbing ‘clean’ traffic throughXDP_TX
) or widely deployed on nodes protecting end hosts themselves (viaXDP_PASS
or cpumapXDP_REDIRECT
for good traffic). Offloaded XDP takes this even one step further by moving the already small per-packet cost entirely into the NIC with processing at line-rate.
Forwarding and load-balancing
Another major use case of XDP is packet forwarding and load-balancing through either
XDP_TX
orXDP_REDIRECT
actions. The packet can be arbitrarily mangled by the BPF program running in the XDP layer, even BPF helper functions are available for increasing or decreasing the packet’s headroom in order to arbitrarily encapsulate respectively decapsulate the packet before sending it out again. WithXDP_TX
hairpinned load-balancers can be implemented that push the packet out of the same networking device it originally arrived on, or with theXDP_REDIRECT
action it can be forwarded to another NIC for transmission. The latter return code can also be used in combination with BPF’s cpumap to load-balance packets for passing up the local stack, but on remote, non-XDP processing CPUs.
Pre-stack filtering / processing
Besides policy enforcement, XDP can also be used for hardening the kernel’s networking stack with the help of
XDP_DROP
case, meaning, it can drop irrelevant packets for a local node right at the earliest possible point before the networking stack sees them e.g. given we know that a node only serves TCP traffic, any UDP, SCTP or other L4 traffic can be dropped right away. This has the advantage that packets do not need to traverse various entities like GRO engine, the kernel’s flow dissector and others before it can be determined to drop them and thus this allows for reducing the kernel’s attack surface. Thanks to XDP’s early processing stage, this effectively ‘pretends’ to the kernel’s networking stack that these packets have never been seen by the networking device. Additionally, if a potential bug in the stack’s receive path got uncovered and would cause a ‘ping of death’ like scenario, XDP can be utilized to drop such packets right away without having to reboot the kernel or restart any services. Due to the ability to atomically swap such programs to enforce a drop of bad packets, no network traffic is even interrupted on a host.Another use case for pre-stack processing is that given the kernel has not yet allocated an
skb
for the packet, the BPF program is free to modify the packet and, again, have it ‘pretend’ to the stack that it was received by the networking device this way. This allows for cases such as having custom packet mangling and encapsulation protocols where the packet can be decapsulated prior to entering GRO aggregation in which GRO otherwise would not be able to perform any sort of aggregation due to not being aware of the custom protocol. XDP also allows to push metadata (non-packet data) in front of the packet. This is ‘invisible’ to the normal kernel stack, can be GRO aggregated (for matching metadata) and later on processed in coordination with a tc ingress BPF program where it has the context of askb
available for e.g. setting various skb fields.
Flow sampling, monitoring
XDP can also be used for cases such as packet monitoring, sampling or any other network analytics, for example, as part of an intermediate node in the path or on end hosts in combination also with prior mentioned use cases. For complex packet analysis, XDP provides a facility to efficiently push network packets (truncated or with full payload) and custom metadata into a fast lockless per CPU memory mapped ring buffer provided from the Linux perf infrastructure to an user space application. This also allows for cases where only a flow’s initial data can be analyzed and once determined as good traffic having the monitoring bypassed. Thanks to the flexibility brought by BPF, this allows for implementing any sort of custom monitoring or sampling.
One example of XDP BPF production usage is Facebook’s SHIV and Droplet infrastructure which implement their L4 load-balancing and DDoS countermeasures. Migrating their production infrastructure away from netfilter’s IPVS (IP Virtual Server) over to XDP BPF allowed for a 10x speedup compared to their previous IPVS setup. This was first presented at the netdev 2.1 conference:
Slides: https://netdevconf.info/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf
Video: https://youtu.be/YEU2ClcGqts
Another example is the integration of XDP into Cloudflare’s DDoS mitigation
pipeline, which originally was using cBPF instead of eBPF for attack signature
matching through iptables’ xt_bpf
module. Due to use of iptables this
caused severe performance problems under attack where a user space bypass
solution was deemed necessary but came with drawbacks as well such as needing
to busy poll the NIC and expensive packet re-injection into the kernel’s stack.
The migration over to eBPF and XDP combined best of both worlds by having
high-performance programmable packet processing directly inside the kernel:
Slides: https://netdevconf.info/2.1/slides/apr6/bertin_Netdev-XDP.pdf
Video: https://youtu.be/7OuOukmuivg
XDP operation modes
XDP has three operation modes where ‘native’ XDP is the default mode. When talked about XDP this mode is typically implied.
Native XDP
This is the default mode where the XDP BPF program is run directly out of the networking driver’s early receive path. Most widespread used NICs for 10G and higher support native XDP already.
Offloaded XDP
In the offloaded XDP mode the XDP BPF program is directly offloaded into the NIC instead of being executed on the host CPU. Thus, the already extremely low per-packet cost is pushed off the host CPU entirely and executed on the NIC, providing even higher performance than running in native XDP. This offload is typically implemented by SmartNICs containing multi-threaded, multicore flow processors where a in-kernel JIT compiler translates BPF into native instructions for the latter. Drivers supporting offloaded XDP usually also support native XDP for cases where some BPF helpers may not yet or only be available for the native mode.
Generic XDP
For drivers not implementing native or offloaded XDP yet, the kernel provides an option for generic XDP which does not require any driver changes since run at a much later point out of the networking stack. This setting is primarily targeted at developers who want to write and test programs against the kernel’s XDP API, and will not operate at the performance rate of the native or offloaded modes. For XDP usage in a production environment either the native or offloaded mode is better suited and the recommended way to run XDP.
Driver support
Drivers supporting native XDP
A list of drivers supporting native XDP can be found in the table below. The corresponding network driver name of an interface can be determined as follows:
# ethtool -i eth0
driver: nfp
[...]
Vendor |
Driver |
XDP Support |
---|---|---|
Amazon |
ena |
>= 5.6 |
Aquantia |
atlantic |
>= 5.19 |
Broadcom |
bnxt_en |
>= 4.11 |
Cavium |
thunderx |
>= 4.12 |
Engleder |
tsne (TSN Express Path) |
>= 6.3 |
Freescale |
dpaa |
>= 5.11 |
dpaa2 |
>= 5.0 |
|
enetc |
>= 5.13 |
|
fec_enet |
>= 6.2 |
|
Fungible |
fun |
>= 5.18 |
gve |
>= 6.4 |
|
Intel |
ice |
>= 5.5 |
igb |
>= 5.10 |
|
igc |
>= 5.13 |
|
i40e |
>= 4.13 |
|
ixgbe |
>= 4.12 |
|
ixgbevf |
>= 4.17 |
|
Marvell |
mvneta |
>= 5.5 |
mvpp2 |
>= 5.9 |
|
otx2 |
>= 5.16 |
|
Mediatek |
mtk |
>= 6.0 |
Mellanox |
mlx4 |
>= 4.8 |
mlx5 |
>= 4.9 |
|
Microchip |
lan966x |
>= 6.2 |
Microsoft |
hv_netvsc (Hyper-V) |
>= 5.6 |
mana |
>= 5.17 |
|
Netronome |
nfp |
>= 4.10 |
Others |
bonding |
>= 5.15 |
netdevsim |
>= 4.16 |
|
tun/tap |
>= 4.14 |
|
virtio_net |
>= 4.10 |
|
xen-netfront |
>= 5.9 |
|
veth |
>= 4.19 |
|
QLogic |
qede |
>= 4.10 |
Socionext |
netsec |
>= 5.3 |
Solarflare |
SFC Efx |
>= 5.5 |
STMicro |
stmmac |
>= 5.13 |
Texas Instruments |
cpsw |
>= 5.3 |
VMware |
vmxnet3 |
>= 6.6 |
Drivers supporting offloaded XDP
Netronome
nfp [2]
Note
Examples for writing and loading XDP programs are included in the Development Tools section under the respective tools.
tc (traffic control)
Aside from other program types such as XDP, BPF can also be used out of the kernel’s tc (traffic control) layer in the networking data path. On a high-level there are three major differences when comparing XDP BPF programs to tc BPF ones:
The BPF input context is a
sk_buff
not axdp_buff
. When the kernel’s networking stack receives a packet, after the XDP layer, it allocates a buffer and parses the packet to store metadata about the packet. This representation is known as thesk_buff
. This structure is then exposed in the BPF input context so that BPF programs from the tc ingress layer can use the metadata that the stack extracts from the packet. This can be useful, but comes with an associated cost of the stack performing this allocation and metadata extraction, and handling the packet until it hits the tc hook. By definition, thexdp_buff
doesn’t have access to this metadata because the XDP hook is called before this work is done. This is a significant contributor to the performance difference between the XDP and tc hooks.Therefore, BPF programs attached to the tc BPF hook can, for instance, read or write the skb’s
mark
,pkt_type
,protocol
,priority
,queue_mapping
,napi_id
,cb[]
array,hash
,tc_classid
ortc_index
, vlan metadata, the XDP transferred custom metadata and various other information. All members of thestruct __sk_buff
BPF context used in tc BPF are defined in thelinux/bpf.h
system header.Generally, the
sk_buff
is of a completely different nature thanxdp_buff
where both come with advantages and disadvantages. For example, thesk_buff
case has the advantage that it is rather straight forward to mangle its associated metadata, however, it also contains a lot of protocol specific information (e.g. GSO related state) which makes it difficult to simply switch protocols by solely rewriting the packet data. This is due to the stack processing the packet based on the metadata rather than having the cost of accessing the packet contents each time. Thus, additional conversion is required from BPF helper functions taking care thatsk_buff
internals are properly converted as well. Thexdp_buff
case however does not face such issues since it comes at such an early stage where the kernel has not even allocated ansk_buff
yet, thus packet rewrites of any kind can be realized trivially. However, thexdp_buff
case has the disadvantage thatsk_buff
metadata is not available for mangling at this stage. The latter is overcome by passing custom metadata from XDP BPF to tc BPF, though. In this way, the limitations of each program type can be overcome by operating complementary programs of both types as the use case requires.
Compared to XDP, tc BPF programs can be triggered out of ingress and also egress points in the networking data path as opposed to ingress only in the case of XDP.
The two hook points
sch_handle_ingress()
andsch_handle_egress()
in the kernel are triggered out of__netif_receive_skb_core()
and__dev_queue_xmit()
, respectively. The latter two are the main receive and transmit functions in the data path that, setting XDP aside, are triggered for every network packet going in or coming out of the node allowing for full visibility for tc BPF programs at these hook points.
The tc BPF programs do not require any driver changes since they are run at hook points in generic layers in the networking stack. Therefore, they can be attached to any type of networking device.
While this provides flexibility, it also trades off performance compared to running at the native XDP layer. However, tc BPF programs still come at the earliest point in the generic kernel’s networking data path after GRO has been run but before any protocol processing, traditional iptables firewalling such as iptables PREROUTING or nftables ingress hooks or other packet processing takes place. Likewise on egress, tc BPF programs execute at the latest point before handing the packet to the driver itself for transmission, meaning after traditional iptables firewalling hooks like iptables POSTROUTING, but still before handing the packet to the kernel’s GSO engine.
One exception which does require driver changes however are offloaded tc BPF programs, typically provided by SmartNICs in a similar way as offloaded XDP just with differing set of features due to the differences in the BPF input context, helper functions and verdict codes.
BPF programs run in the tc layer are run from the cls_bpf
classifier.
While the tc terminology describes the BPF attachment point as a “classifier”,
this is a bit misleading since it under-represents what cls_bpf
is
capable of. That is to say, a fully programmable packet processor being able
not only to read the skb
metadata and packet data, but to also arbitrarily
mangle both, and terminate the tc processing with an action verdict. cls_bpf
can thus be regarded as a self-contained entity that manages and executes tc
BPF programs.
cls_bpf
can hold one or more tc BPF programs. In the case where Cilium
deploys cls_bpf
programs, it attaches only a single program for a given hook
in direct-action
mode. Typically, in the traditional tc scheme, there is a
split between classifier and action modules, where the classifier has one
or more actions attached to it that are triggered once the classifier has a
match. In the modern world for using tc in the software data path this model
does not scale well for complex packet processing. Given tc BPF programs
attached to cls_bpf
are fully self-contained, they effectively fuse the
parsing and action process together into a single unit. Thanks to cls_bpf
’s
direct-action
mode, it will just return the tc action verdict and
terminate the processing pipeline immediately. This allows for implementing
scalable programmable packet processing in the networking data path by avoiding
linear iteration of actions. cls_bpf
is the only such “classifier” module
in the tc layer capable of such a fast-path.
Like XDP BPF programs, tc BPF programs can be atomically updated at runtime
via cls_bpf
without interrupting any network traffic or having to restart
services.
Both the tc ingress and the egress hook where cls_bpf
itself can be
attached to is managed by a pseudo qdisc called sch_clsact
. This is a
drop-in replacement and proper superset of the ingress qdisc since it
is able to manage both, ingress and egress tc hooks. For tc’s egress hook
in __dev_queue_xmit()
it is important to stress that it is not executed
under the kernel’s qdisc root lock. Thus, both tc ingress and egress hooks
are executed in a lockless manner in the fast-path. In either case, preemption
is disabled and execution happens under RCU read side.
Typically on egress there are qdiscs attached to netdevices such as sch_mq
,
sch_fq
, sch_fq_codel
or sch_htb
where some of them are classful
qdiscs that contain subclasses and thus require a packet classification
mechanism to determine a verdict where to demux the packet. This is handled
by a call to tcf_classify()
which calls into tc classifiers if present.
cls_bpf
can also be attached and used in such cases. Such operation usually
happens under the qdisc root lock and can be subject to lock contention. The
sch_clsact
qdisc’s egress hook comes at a much earlier point however which
does not fall under that and operates completely independent from conventional
egress qdiscs. Thus for cases like sch_htb
the sch_clsact
qdisc could
perform the heavy lifting packet classification through tc BPF outside of the
qdisc root lock, setting the skb->mark
or skb->priority
from there such
that sch_htb
only requires a flat mapping without expensive packet
classification under the root lock thus reducing contention.
Offloaded tc BPF programs are supported for the case of sch_clsact
in
combination with cls_bpf
where the prior loaded BPF program was JITed
from a SmartNIC driver to be run natively on the NIC. Only cls_bpf
programs operating in direct-action
mode are supported to be offloaded.
cls_bpf
only supports offloading a single program and cannot offload
multiple programs. Furthermore only the ingress hook supports offloading
BPF programs.
One cls_bpf
instance is able to hold multiple tc BPF programs internally.
If this is the case, then the TC_ACT_UNSPEC
program return code will
continue execution with the next tc BPF program in that list. However, this
has the drawback that several programs would need to parse the packet over
and over again resulting in degraded performance.
BPF program return codes
Both the tc ingress and egress hook share the same action return verdicts
that tc BPF programs can use. They are defined in the linux/pkt_cls.h
system header:
#define TC_ACT_UNSPEC (-1)
#define TC_ACT_OK 0
#define TC_ACT_SHOT 2
#define TC_ACT_STOLEN 4
#define TC_ACT_REDIRECT 7
There are a few more action TC_ACT_*
verdicts available in the system
header file which are also used in the two hooks. However, they share the
same semantics with the ones above. Meaning, from a tc BPF perspective,
TC_ACT_OK
and TC_ACT_RECLASSIFY
have the same semantics, as well as
the three TC_ACT_STOLEN
, TC_ACT_QUEUED
and TC_ACT_TRAP
opcodes.
Therefore, for these cases we only describe TC_ACT_OK
and the TC_ACT_STOLEN
opcode for the two groups.
Starting out with TC_ACT_UNSPEC
. It has the meaning of “unspecified action”
and is used in three cases, i) when an offloaded tc BPF program is attached
and the tc ingress hook is run where the cls_bpf
representation for the
offloaded program will return TC_ACT_UNSPEC
, ii) in order to continue
with the next tc BPF program in cls_bpf
for the multi-program case. The
latter also works in combination with offloaded tc BPF programs from point i)
where the TC_ACT_UNSPEC
from there continues with a next tc BPF program
solely running in non-offloaded case. Last but not least, iii) TC_ACT_UNSPEC
is also used for the single program case to simply tell the kernel to continue
with the skb
without additional side-effects. TC_ACT_UNSPEC
is very
similar to the TC_ACT_OK
action code in the sense that both pass the
skb
onwards either to upper layers of the stack on ingress or down to
the networking device driver for transmission on egress, respectively. The
only difference to TC_ACT_OK
is that TC_ACT_OK
sets skb->tc_index
based on the classid the tc BPF program set. The latter is set out of the
tc BPF program itself through skb->tc_classid
from the BPF context.
TC_ACT_SHOT
instructs the kernel to drop the packet, meaning, upper
layers of the networking stack will never see the skb
on ingress and
similarly the packet will never be submitted for transmission on egress.
TC_ACT_SHOT
and TC_ACT_STOLEN
are both similar in nature with few
differences: TC_ACT_SHOT
will indicate to the kernel that the skb
was released through kfree_skb()
and return NET_XMIT_DROP
to the
callers for immediate feedback, whereas TC_ACT_STOLEN
will release
the skb
through consume_skb()
and pretend to upper layers that
the transmission was successful through NET_XMIT_SUCCESS
. The perf’s
drop monitor which records traces of kfree_skb()
will therefore
also not see any drop indications from TC_ACT_STOLEN
since its
semantics are such that the skb
has been “consumed” or queued but
certainly not “dropped”.
Last but not least the TC_ACT_REDIRECT
action which is available for
tc BPF programs as well. This allows to redirect the skb
to the same
or another’s device ingress or egress path together with the bpf_redirect()
helper. Being able to inject the packet into another device’s ingress or
egress direction allows for full flexibility in packet forwarding with
BPF. There are no requirements on the target networking device other than
being a networking device itself, there is no need to run another instance
of cls_bpf
on the target device or other such restrictions.
tc BPF FAQ
This section contains a few miscellaneous question and answer pairs related to tc BPF programs that are asked from time to time.
Question: What about
act_bpf
as a tc action module, is it still relevant?Answer: Not really. Although
cls_bpf
andact_bpf
share the same functionality for tc BPF programs,cls_bpf
is more flexible since it is a proper superset ofact_bpf
. The way tc works is that tc actions need to be attached to tc classifiers. In order to achieve the same flexibility ascls_bpf
,act_bpf
would need to be attached to thecls_matchall
classifier. As the name says, this will match on every packet in order to pass them through for attached tc action processing. Foract_bpf
, this is will result in less efficient packet processing than usingcls_bpf
indirect-action
mode directly. Ifact_bpf
is used in a setting with other classifiers thancls_bpf
orcls_matchall
then this will perform even worse due to the nature of operation of tc classifiers. Meaning, if classifier A has a mismatch, then the packet is passed to classifier B, reparsing the packet, etc, thus in the typical case there will be linear processing where the packet would need to traverse N classifiers in the worst case to find a match and executeact_bpf
on that. Therefore,act_bpf
has never been largely relevant. Additionally,act_bpf
does not provide a tc offloading interface either compared tocls_bpf
.
Question: Is it recommended to use
cls_bpf
not indirect-action
mode?Answer: No. The answer is similar to the one above in that this is otherwise unable to scale for more complex processing. tc BPF can already do everything needed by itself in an efficient manner and thus there is no need for anything other than
direct-action
mode.
Question: Is there any performance difference in offloaded
cls_bpf
and offloaded XDP?Answer: No. Both are JITed through the same compiler in the kernel which handles the offloading to the SmartNIC and the loading mechanism for both is very similar as well. Thus, the BPF program gets translated into the same target instruction set in order to be able to run on the NIC natively. The two tc BPF and XDP BPF program types have a differing set of features, so depending on the use case one might be picked over the other due to availability of certain helper functions in the offload case, for example.
Use cases for tc BPF
Some of the main use cases for tc BPF programs are presented in this subsection. Also here, the list is non-exhaustive and given the programmability and efficiency of tc BPF, it can easily be tailored and integrated into orchestration systems in order to solve very specific use cases. While some use cases with XDP may overlap, tc BPF and XDP BPF are mostly complementary to each other and both can also be used at the same time or one over the other depending which is most suitable for a given problem to solve.
Policy enforcement for containers
One application which tc BPF programs are suitable for is to implement policy enforcement, custom firewalling or similar security measures for containers or pods, respectively. In the conventional case, container isolation is implemented through network namespaces with veth networking devices connecting the host’s initial namespace with the dedicated container’s namespace. Since one end of the veth pair has been moved into the container’s namespace whereas the other end remains in the initial namespace of the host, all network traffic from the container has to pass through the host-facing veth device allowing for attaching tc BPF programs on the tc ingress and egress hook of the veth. Network traffic going into the container will pass through the host-facing veth’s tc egress hook whereas network traffic coming from the container will pass through the host-facing veth’s tc ingress hook.
For virtual devices like veth devices XDP is unsuitable in this case since the kernel operates solely on a
skb
here and generic XDP has a few limitations where it does not operate with clonedskb
’s. The latter is heavily used from the TCP/IP stack in order to hold data segments for retransmission where the generic XDP hook would simply get bypassed instead. Moreover, generic XDP needs to linearize the entireskb
resulting in heavily degraded performance. tc BPF on the other hand is more flexible as it specializes on theskb
input context case and thus does not need to cope with the limitations from generic XDP.
Forwarding and load-balancing
The forwarding and load-balancing use case is quite similar to XDP, although slightly more targeted towards east-west container workloads rather than north-south traffic (though both technologies can be used in either case). Since XDP is only available on ingress side, tc BPF programs allow for further use cases that apply in particular on egress, for example, container based traffic can already be NATed and load-balanced on the egress side through BPF out of the initial namespace such that this is done transparent to the container itself. Egress traffic is already based on the
sk_buff
structure due to the nature of the kernel’s networking stack, so packet rewrites and redirects are suitable out of tc BPF. By utilizing thebpf_redirect()
helper function, BPF can take over the forwarding logic to push the packet either into the ingress or egress path of another networking device. Thus, any bridge-like devices become unnecessary to use as well by utilizing tc BPF as forwarding fabric.
Flow sampling, monitoring
Like in XDP case, flow sampling and monitoring can be realized through a high-performance lockless per-CPU memory mapped perf ring buffer where the BPF program is able to push custom data, the full or truncated packet contents, or both up to a user space application. From the tc BPF program this is realized through the
bpf_skb_event_output()
BPF helper function which has the same function signature and semantics asbpf_xdp_event_output()
. Given tc BPF programs can be attached to ingress and egress as opposed to only ingress in XDP BPF case plus the two tc hooks are at the lowest layer in the (generic) networking stack, this allows for bidirectional monitoring of all network traffic from a particular node. This might be somewhat related to the cBPF case which tcpdump and Wireshark makes use of, though, without having to clone theskb
and with being a lot more flexible in terms of programmability where, for example, BPF can already perform in-kernel aggregation rather than pushing everything up to user space as well as custom annotations for packets pushed into the ring buffer. The latter is also heavily used in Cilium where packet drops can be further annotated to correlate container labels and reasons for why a given packet had to be dropped (such as due to policy violation) in order to provide a richer context.
Packet scheduler pre-processing
The
sch_clsact
’s egress hook which is calledsch_handle_egress()
runs right before taking the kernel’s qdisc root lock, thus tc BPF programs can be utilized to perform all the heavy lifting packet classification and mangling before the packet is transmitted into a real full blown qdisc such assch_htb
. This type of interaction ofsch_clsact
with a real qdisc likesch_htb
coming later in the transmission phase allows to reduce the lock contention on transmission sincesch_clsact
’s egress hook is executed without taking locks.
One concrete example user of tc BPF but also XDP BPF programs is Cilium. Cilium is open source software for transparently securing the network connectivity between application services deployed using Linux container management platforms like Docker and Kubernetes and operates at Layer 3/4 as well as Layer 7. At the heart of Cilium operates BPF in order to implement the policy enforcement as well as load balancing and monitoring.
Video: https://youtu.be/ilKlmTDdFgk
Github: https://github.com/cilium/cilium
Driver support
Since tc BPF programs are triggered from the kernel’s networking stack and not directly out of the driver, they do not require any extra driver modification and therefore can run on any networking device. The only exception listed below is for offloading tc BPF programs to the NIC.
Drivers supporting offloaded tc BPF
Netronome
nfp [2]
Note
Examples for writing and loading tc BPF programs are included in the Development Tools section under the respective tools.