Development Tools
Current user space tooling, introspection facilities and kernel control knobs around BPF are discussed in this section.
Note
The tooling and infrastructure around BPF is still rapidly evolving and thus may not provide a complete picture of all available tools.
Development Environment
A step by step guide for setting up a development environment for BPF can be found below for both Fedora and Ubuntu. This will guide you through building, installing and testing a development kernel as well as building and installing iproute2.
The step of manually building iproute2 and Linux kernel is usually not necessary given that major distributions already ship recent enough kernels by default, but would be needed for testing bleeding edge versions or contributing BPF patches to iproute2 and to the Linux kernel, respectively. Similarly, for debugging and introspection purposes building bpftool is optional, but recommended.
The following applies to Fedora 25 or later:
$ sudo dnf install -y git gcc ncurses-devel elfutils-libelf-devel bc \
openssl-devel libcap-devel clang llvm graphviz bison flex glibc-static
Note
If you are running some other Fedora derivative and dnf
is missing,
try using yum
instead.
The following applies to Ubuntu 17.04 or later:
$ sudo apt-get install -y make gcc libssl-dev bc libelf-dev libcap-dev \
clang gcc-multilib llvm libncurses5-dev git pkg-config libmnl-dev bison flex \
graphviz
The following applies to openSUSE Tumbleweed and openSUSE Leap 15.0 or later:
$ sudo zypper install -y git gcc ncurses-devel libelf-devel bc libopenssl-devel \
libcap-devel clang llvm graphviz bison flex glibc-devel-static
Compiling the Kernel
Development of new BPF features for the Linux kernel happens inside the net-next
git tree, latest BPF fixes in the net
tree. The following command will obtain
the kernel source for the net-next
tree through git:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
If the git commit history is not of interest, then --depth 1
will clone the
tree much faster by truncating the git history only to the most recent commit.
In case the net
tree is of interest, it can be cloned from this url:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
There are dozens of tutorials in the Internet on how to build Linux kernels, one good resource is the Kernel Newbies website (https://kernelnewbies.org/KernelBuild) that can be followed with one of the two git trees mentioned above.
Make sure that the generated .config
file contains the following CONFIG_*
entries for running BPF. These entries are also needed for Cilium.
CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_CLS_ACT=y
CONFIG_BPF_JIT=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_TEST_BPF=m
Some of the entries cannot be adjusted through make menuconfig
. For example,
CONFIG_HAVE_EBPF_JIT
is selected automatically if a given architecture does
come with an eBPF JIT. In this specific case, CONFIG_HAVE_EBPF_JIT
is optional
but highly recommended. An architecture not having an eBPF JIT compiler will need
to fall back to the in-kernel interpreter with the cost of being less efficient
executing BPF instructions.
Verifying the Setup
After you have booted into the newly compiled kernel, navigate to the BPF selftest suite in order to test BPF functionality (current working directory points to the root of the cloned git tree):
$ cd tools/testing/selftests/bpf/
$ make
$ sudo ./test_verifier
The verifier tests print out all the current checks being performed. The summary at the end of running all tests will dump information of test successes and failures:
Summary: 847 PASSED, 0 SKIPPED, 0 FAILED
Note
For kernel releases 4.16+ the BPF selftest has a dependency on LLVM 6.0+ caused by the BPF function calls which do not need to be inlined anymore. See section BPF to BPF Calls or the cover letter mail from the kernel patch (https://lwn.net/Articles/741773/) for more information. Not every BPF program has a dependency on LLVM 6.0+ if it does not use this new feature. If your distribution does not provide LLVM 6.0+ you may compile it by following the instruction in the LLVM section.
In order to run through all BPF selftests, the following command is needed:
$ sudo make run_tests
If you see any failures, please contact us on Cilium Slack with the full test output.
Compiling iproute2
Similar to the net
(fixes only) and net-next
(new features) kernel trees,
the iproute2 git tree has two branches, namely master
and net-next
. The
master
branch is based on the net
tree and the net-next
branch is
based against the net-next
kernel tree. This is necessary, so that changes
in header files can be synchronized in the iproute2 tree.
In order to clone the iproute2 master
branch, the following command can
be used:
$ git clone https://git.kernel.org/pub/scm/network/iproute2/iproute2.git
Similarly, to clone into mentioned net-next
branch of iproute2, run the
following:
$ git clone -b net-next https://git.kernel.org/pub/scm/network/iproute2/iproute2.git
After that, proceed with the build and installation:
$ cd iproute2/
$ ./configure --prefix=/usr
TC schedulers
ATM no
libc has setns: yes
SELinux support: yes
ELF support: yes
libmnl support: no
Berkeley DB: no
docs: latex: no
WARNING: no docs can be built from LaTeX files
sgml2html: no
WARNING: no HTML docs can be built from SGML
$ make
[...]
$ sudo make install
Ensure that the configure
script shows ELF support: yes
, so that iproute2
can process ELF files from LLVM’s BPF back end. libelf was listed in the instructions
for installing the dependencies in case of Fedora and Ubuntu earlier.
Compiling bpftool
bpftool is an essential tool around debugging and introspection of BPF programs
and maps. It is part of the kernel tree and available under tools/bpf/bpftool/
.
Make sure to have cloned either the net
or net-next
kernel tree as described
earlier. In order to build and install bpftool, the following steps are required:
$ cd <kernel-tree>/tools/bpf/bpftool/
$ make
Auto-detecting system features:
... libbfd: [ on ]
... disassembler-four-args: [ OFF ]
CC xlated_dumper.o
CC prog.o
CC common.o
CC cgroup.o
CC main.o
CC json_writer.o
CC cfg.o
CC map.o
CC jit_disasm.o
CC disasm.o
make[1]: Entering directory '/home/foo/trees/net/tools/lib/bpf'
Auto-detecting system features:
... libelf: [ on ]
... bpf: [ on ]
CC libbpf.o
CC bpf.o
CC nlattr.o
LD libbpf-in.o
LINK libbpf.a
make[1]: Leaving directory '/home/foo/trees/bpf/tools/lib/bpf'
LINK bpftool
$ sudo make install
LLVM
LLVM is currently the only compiler suite providing a BPF back end. gcc does not support BPF at this point.
The BPF back end was merged into LLVM’s 3.7 release. Major distributions enable the BPF back end by default when they package LLVM, therefore installing clang and llvm is sufficient on most recent distributions to start compiling C into BPF object files.
The typical workflow is that BPF programs are written in C, compiled by LLVM into object / ELF files, which are parsed by user space BPF ELF loaders (such as iproute2 or others), and pushed into the kernel through the BPF system call. The kernel verifies the BPF instructions and JITs them, returning a new file descriptor for the program, which then can be attached to a subsystem (e.g. networking). If supported, the subsystem could then further offload the BPF program to hardware (e.g. NIC).
For LLVM, BPF target support can be checked, for example, through the following:
$ llc --version
LLVM (http://llvm.org/):
LLVM version 3.8.1
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
Registered Targets:
[...]
bpf - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)
[...]
By default, the bpf
target uses the endianness of the CPU it compiles on,
meaning that if the CPU’s endianness is little endian, the program is represented
in little endian format as well, and if the CPU’s endianness is big endian,
the program is represented in big endian. This also matches the runtime behavior
of BPF, which is generic and uses the CPU’s endianness it runs on in order
to not disadvantage architectures in any of the format.
For cross-compilation, the two targets bpfeb
and bpfel
were introduced,
thanks to that BPF programs can be compiled on a node running in one endianness
(e.g. little endian on x86) and run on a node in another endianness format (e.g.
big endian on arm). Note that the front end (clang) needs to run in the target
endianness as well.
Using bpf
as a target is the preferred way in situations where no mixture of
endianness applies. For example, compilation on x86_64
results in the same
output for the targets bpf
and bpfel
due to being little endian, therefore
scripts triggering a compilation also do not have to be endian aware.
A minimal, stand-alone XDP drop program might look like the following example
(xdp-example.c
):
#include <linux/bpf.h>
#ifndef __section
# define __section(NAME) \
__attribute__((section(NAME), used))
#endif
__section("prog")
int xdp_drop(struct xdp_md *ctx)
{
return XDP_DROP;
}
char __license[] __section("license") = "GPL";
It can then be compiled and loaded into the kernel as follows:
$ clang -O2 -Wall --target=bpf -c xdp-example.c -o xdp-example.o
# ip link set dev em1 xdp obj xdp-example.o
Note
Attaching an XDP BPF program to a network device as above requires Linux 4.11 with a device that supports XDP, or Linux 4.12 or later.
For the generated object file LLVM (>= 3.9) uses the official BPF machine value,
that is, EM_BPF
(decimal: 247
/ hex: 0xf7
). In this example, the program
has been compiled with bpf
target under x86_64
, therefore LSB
(as opposed
to MSB
) is shown regarding endianness:
$ file xdp-example.o
xdp-example.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped
readelf -a xdp-example.o
will dump further information about the ELF file, which can
sometimes be useful for introspecting generated section headers, relocation entries
and the symbol table.
In the unlikely case where clang and LLVM need to be compiled from scratch, the following commands can be used:
$ git clone https://github.com/llvm/llvm-project.git
$ cd llvm-project
$ mkdir build
$ cd build
$ cmake -DLLVM_ENABLE_PROJECTS=clang -DLLVM_TARGETS_TO_BUILD="BPF;X86" -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_RUNTIME=OFF -G "Unix Makefiles" ../llvm
$ make -j $(getconf _NPROCESSORS_ONLN)
$ ./bin/llc --version
LLVM (http://llvm.org/):
LLVM version x.y.zsvn
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
Registered Targets:
bpf - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
$ export PATH=$PWD/bin:$PATH # add to ~/.bashrc
Make sure that --version
mentions Optimized build.
, otherwise the
compilation time for programs when having LLVM in debugging mode will
significantly increase (e.g. by 10x or more).
For debugging, clang can generate the assembler output as follows:
$ clang -O2 -S -Wall --target=bpf -c xdp-example.c -o xdp-example.S
$ cat xdp-example.S
.text
.section prog,"ax",@progbits
.globl xdp_drop
.p2align 3
xdp_drop: # @xdp_drop
# BB#0:
r0 = 1
exit
.section license,"aw",@progbits
.globl __license # @__license
__license:
.asciz "GPL"
Starting from LLVM’s release 6.0, there is also assembler parser support. You can program using BPF assembler directly, then use llvm-mc to assemble it into an object file. For example, you can assemble the xdp-example.S listed above back into object file using:
$ llvm-mc -triple bpf -filetype=obj -o xdp-example.o xdp-example.S
Furthermore, more recent LLVM versions (>= 4.0) can also store debugging
information in dwarf format into the object file. This can be done through
the usual workflow by adding -g
for compilation.
$ clang -O2 -g -Wall --target=bpf -c xdp-example.c -o xdp-example.o
$ llvm-objdump -S --no-show-raw-insn xdp-example.o
xdp-example.o: file format ELF64-BPF
Disassembly of section prog:
xdp_drop:
; {
0: r0 = 1
; return XDP_DROP;
1: exit
The llvm-objdump
tool can then annotate the assembler output with the
original C code used in the compilation. The trivial example in this case
does not contain much C code, however, the line numbers shown as 0:
and 1:
correspond directly to the kernel’s verifier log.
This means that in case BPF programs get rejected by the verifier, llvm-objdump
can help to correlate the instructions back to the original C code, which is
highly useful for analysis.
# ip link set dev em1 xdp obj xdp-example.o verb
Prog section 'prog' loaded (5)!
- Type: 6
- Instructions: 2 (0 over limit)
- License: GPL
Verifier analysis:
0: (b7) r0 = 1
1: (95) exit
processed 2 insns
As it can be seen in the verifier analysis, the llvm-objdump
output dumps
the same BPF assembler code as the kernel.
Leaving out the --no-show-raw-insn
option will also dump the raw
struct bpf_insn
as hex in front of the assembly:
$ llvm-objdump -S xdp-example.o
xdp-example.o: file format ELF64-BPF
Disassembly of section prog:
xdp_drop:
; {
0: b7 00 00 00 01 00 00 00 r0 = 1
; return foo();
1: 95 00 00 00 00 00 00 00 exit
For LLVM IR debugging, the compilation process for BPF can be split into
two steps, generating a binary LLVM IR intermediate file xdp-example.bc
, which
can later on be passed to llc:
$ clang -O2 -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -filetype=obj -o xdp-example.o
The generated LLVM IR can also be dumped in human readable format through:
$ clang -O2 -Wall -emit-llvm -S -c xdp-example.c -o -
LLVM is able to attach debug information such as the description of used data types in the program to the generated BPF object file. By default this is in DWARF format.
A heavily simplified version used by BPF is called BTF (BPF Type Format). The resulting DWARF can be converted into BTF and is later on loaded into the kernel through BPF object loaders. The kernel will then verify the BTF data for correctness and keeps track of the data types the BTF data is containing.
BPF maps can then be annotated with key and value types out of the BTF data such that a later dump of the map exports the map data along with the related type information. This allows for better introspection, debugging and value pretty printing. Note that BTF data is a generic debugging data format and as such any DWARF to BTF converted data can be loaded (e.g. kernel’s vmlinux DWARF data could be converted to BTF and loaded). Latter is in particular useful for BPF tracing in the future.
In order to generate BTF from DWARF debugging information, elfutils (>= 0.173)
is needed. If that is not available, then adding the -mattr=dwarfris
option
to the llc
command is required during compilation:
$ llc -march=bpf -mattr=help |& grep dwarfris
dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections.
[...]
The reason using -mattr=dwarfris
is because the flag dwarfris
(dwarf
relocation in section
) disables DWARF cross-section relocations between DWARF
and the ELF’s symbol table since libdw does not have proper BPF relocation
support, and therefore tools like pahole
would otherwise not be able to
properly dump structures from the object.
elfutils (>= 0.173) implements proper BPF relocation support and therefore
the same can be achieved without the -mattr=dwarfris
option. Dumping
the structures from the object file could be done from either DWARF or BTF
information. pahole
uses the LLVM emitted DWARF information at this
point, however, future pahole
versions could rely on BTF if available.
For converting DWARF into BTF, a recent pahole version (>= 1.12) is required. A recent pahole version can also be obtained from its official git repository if not available from one of the distribution packages:
$ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git
pahole
comes with the option -J
to convert DWARF into BTF from an
object file. pahole
can be probed for BTF support as follows (note that
the llvm-objcopy
tool is required for pahole
as well, so check its
presence, too):
$ pahole --help | grep BTF
-J, --btf_encode Encode as BTF
Generating debugging information also requires the front end to generate
source level debug information by passing -g
to the clang
command
line. Note that -g
is needed independently of whether llc
’s
dwarfris
option is used. Full example for generating the object file:
$ clang -O2 -g -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -mattr=dwarfris -filetype=obj -o xdp-example.o
Alternatively, by using clang only to build a BPF program with debugging information (again, the dwarfris flag can be omitted when having proper elfutils version):
$ clang --target=bpf -O2 -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
After successful compilation pahole
can be used to properly dump structures
of the BPF program based on the DWARF information:
$ pahole xdp-example.o
struct xdp_md {
__u32 data; /* 0 4 */
__u32 data_end; /* 4 4 */
__u32 data_meta; /* 8 4 */
/* size: 12, cachelines: 1, members: 3 */
/* last cacheline: 12 bytes */
};
Through the option -J
pahole
can eventually generate the BTF from
DWARF. In the object file DWARF data will still be retained alongside the
newly added BTF data. Full clang
and pahole
example combined:
$ clang --target=bpf -O2 -Wall -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
$ pahole -J xdp-example.o
The presence of a .BTF
section can be seen through readelf
tool:
$ readelf -a xdp-example.o
[...]
[18] .BTF PROGBITS 0000000000000000 00000671
[...]
BPF loaders such as iproute2 will detect and load the BTF section, so that BPF maps can be annotated with type information.
LLVM by default uses the BPF base instruction set for generating code in order to make sure that the generated object file can also be loaded with older kernels such as long-term stable kernels (e.g. 4.9+).
However, LLVM has a -mcpu
selector for the BPF back end in order to
select different versions of the BPF instruction set, namely instruction
set extensions on top of the BPF base instruction set in order to generate
more efficient and smaller code.
Available -mcpu
options can be queried through:
$ llc -march bpf -mcpu=help
Available CPUs for this target:
generic - Select the generic processor.
probe - Select the probe processor.
v1 - Select the v1 processor.
v2 - Select the v2 processor.
[...]
The generic
processor is the default processor, which is also the
base instruction set v1
of BPF. Options v1
and v2
are typically
useful in an environment where the BPF program is being cross compiled
and the target host where the program is loaded differs from the one
where it is compiled (and thus available BPF kernel features might differ
as well).
The recommended -mcpu
option which is also used by Cilium internally is
-mcpu=probe
! Here, the LLVM BPF back end queries the kernel for availability
of BPF instruction set extensions and when found available, LLVM will use
them for compiling the BPF program whenever appropriate.
A full command line example with llc’s -mcpu=probe
:
$ clang -O2 -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -mcpu=probe -filetype=obj -o xdp-example.o
Generally, LLVM IR generation is architecture independent. There are
however a few differences when using clang --target=bpf
versus
leaving --target=bpf
out and thus using clang’s default target which,
depending on the underlying architecture, might be x86_64
, arm64
or others.
Quoting from the kernel’s Documentation/bpf/bpf_devel_QA.txt
:
BPF programs may recursively include header file(s) with file scope inline assembly codes. The default target can handle this well, while bpf target may fail if bpf backend assembler does not understand these assembly codes, which is true in most cases.
When compiled without -g, additional elf sections, e.g.,
.eh_frame
and.rela.eh_frame
, may be present in the object file with default target, but not with bpf target.The default target may turn a C switch statement into a switch table lookup and jump operation. Since the switch table is placed in the global read-only section, the bpf program will fail to load. The bpf target does not support switch table optimization. The clang option
-fno-jump-tables
can be used to disable switch table generation.For clang
--target=bpf
, it is guaranteed that pointer or long / unsigned long types will always have a width of 64 bit, no matter whether underlying clang binary or default target (or kernel) is 32 bit. However, when native clang target is used, then it will compile these types based on the underlying architecture’s conventions, meaning in case of 32 bit architecture, pointer or long / unsigned long types e.g. in BPF context structure will have width of 32 bit while the BPF LLVM back end still operates in 64 bit.
The native target is mostly needed in tracing for the case of walking
the kernel’s struct pt_regs
that maps CPU registers, or other kernel
structures where CPU’s register width matters. In all other cases such
as networking, the use of clang --target=bpf
is the preferred choice.
Also, LLVM started to support 32-bit subregisters and BPF ALU32 instructions since
LLVM’s release 7.0. A new code generation attribute alu32
is added. When it is
enabled, LLVM will try to use 32-bit subregisters whenever possible, typically
when there are operations on 32-bit types. The associated ALU instructions with
32-bit subregisters will become ALU32 instructions. For example, for the
following sample code:
$ cat 32-bit-example.c
void cal(unsigned int *a, unsigned int *b, unsigned int *c)
{
unsigned int sum = *a + *b;
*c = sum;
}
At default code generation, the assembler will looks like:
$ clang --target=bpf -emit-llvm -S 32-bit-example.c
$ llc -march=bpf 32-bit-example.ll
$ cat 32-bit-example.s
cal:
r1 = *(u32 *)(r1 + 0)
r2 = *(u32 *)(r2 + 0)
r2 += r1
*(u32 *)(r3 + 0) = r2
exit
64-bit registers are used, hence the addition means 64-bit addition. Now, if you
enable the new 32-bit subregisters support by specifying -mattr=+alu32
, then
the assembler will looks like:
$ llc -march=bpf -mattr=+alu32 32-bit-example.ll
$ cat 32-bit-example.s
cal:
w1 = *(u32 *)(r1 + 0)
w2 = *(u32 *)(r2 + 0)
w2 += w1
*(u32 *)(r3 + 0) = w2
exit
w
register, meaning 32-bit subregister, will be used instead of 64-bit r
register.
Enable 32-bit subregisters might help reducing type extension instruction sequences. It could also help kernel eBPF JIT compiler for 32-bit architectures for which registers pairs are used to model the 64-bit eBPF registers and extra instructions are needed for manipulating the high 32-bit. Given read from 32-bit subregister is guaranteed to read from low 32-bit only even though write still needs to clear the high 32-bit, if the JIT compiler has known the definition of one register only has subregister reads, then instructions for setting the high 32-bit of the destination could be eliminated.
When writing C programs for BPF, there are a couple of pitfalls to be aware of, compared to usual application development with C. The following items describe some of the differences for the BPF model:
Everything needs to be inlined, there are no function calls (on older LLVM versions) or shared library calls available.
Shared libraries, etc cannot be used with BPF. However, common library code used in BPF programs can be placed into header files and included in the main programs. For example, Cilium makes heavy use of it (see
bpf/lib/
). However, this still allows for including header files, for example, from the kernel or other libraries and reuse their static inline functions or macros / definitions.Unless a recent kernel (4.16+) and LLVM (6.0+) is used where BPF to BPF function calls are supported, then LLVM needs to compile and inline the entire code into a flat sequence of BPF instructions for a given program section. In such case, best practice is to use an annotation like
__inline
for every library function as shown below. The use ofalways_inline
is recommended, since the compiler could still decide to uninline large functions that are only annotated asinline
.In case the latter happens, LLVM will generate a relocation entry into the ELF file, which BPF ELF loaders such as iproute2 cannot resolve and will thus produce an error since only BPF maps are valid relocation entries which loaders can process.
#include <linux/bpf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif #ifndef __inline # define __inline \ inline __attribute__((always_inline)) #endif static __inline int foo(void) { return XDP_DROP; } __section("prog") int xdp_drop(struct xdp_md *ctx) { return foo(); } char __license[] __section("license") = "GPL";
Multiple programs can reside inside a single C file in different sections.
C programs for BPF make heavy use of section annotations. A C file is typically structured into 3 or more sections. BPF ELF loaders use these names to extract and prepare the relevant information in order to load the programs and maps through the bpf system call. For example, iproute2 uses
maps
andlicense
as default section name to find metadata needed for map creation and the license for the BPF program, respectively. On program creation time the latter is pushed into the kernel as well, and enables some of the helper functions which are exposed as GPL only in case the program also holds a GPL compatible license, for examplebpf_ktime_get_ns()
,bpf_probe_read()
and others.The remaining section names are specific for BPF program code, for example, the below code has been modified to contain two program sections,
ingress
andegress
. The toy example code demonstrates that both can share a map and common static inline helpers such as theaccount_data()
function.The
xdp-example.c
example has been modified to atc-example.c
example that can be loaded with tc and attached to a netdevice’s ingress and egress hook. It accounts the transferred bytes into a map calledacc_map
, which has two map slots, one for traffic accounted on the ingress hook, one on the egress hook.#include <linux/bpf.h> #include <linux/pkt_cls.h> #include <stdint.h> #include <iproute2/bpf_elf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif #ifndef __inline # define __inline \ inline __attribute__((always_inline)) #endif #ifndef lock_xadd # define lock_xadd(ptr, val) \ ((void)__sync_fetch_and_add(ptr, val)) #endif #ifndef BPF_FUNC # define BPF_FUNC(NAME, ...) \ (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME #endif static void *BPF_FUNC(map_lookup_elem, void *map, const void *key); struct bpf_elf_map acc_map __section("maps") = { .type = BPF_MAP_TYPE_ARRAY, .size_key = sizeof(uint32_t), .size_value = sizeof(uint32_t), .pinning = PIN_GLOBAL_NS, .max_elem = 2, }; static __inline int account_data(struct __sk_buff *skb, uint32_t dir) { uint32_t *bytes; bytes = map_lookup_elem(&acc_map, &dir); if (bytes) lock_xadd(bytes, skb->len); return TC_ACT_OK; } __section("ingress") int tc_ingress(struct __sk_buff *skb) { return account_data(skb, 0); } __section("egress") int tc_egress(struct __sk_buff *skb) { return account_data(skb, 1); } char __license[] __section("license") = "GPL";
The example also demonstrates a couple of other things which are useful to be aware of when developing programs. The code includes kernel headers, standard C headers and an iproute2 specific header containing the definition of
struct bpf_elf_map
. iproute2 has a common BPF ELF loader and as such the definition ofstruct bpf_elf_map
is the very same for XDP and tc typed programs.A
struct bpf_elf_map
entry defines a map in the program and contains all relevant information (such as key / value size, etc) needed to generate a map which is used from the two BPF programs. The structure must be placed into themaps
section, so that the loader can find it. There can be multiple map declarations of this type with different variable names, but all must be annotated with__section("maps")
.The
struct bpf_elf_map
is specific to iproute2. Different BPF ELF loaders can have different formats, for example, the libbpf in the kernel source tree, which is mainly used byperf
, has a different specification. iproute2 guarantees backwards compatibility forstruct bpf_elf_map
. Cilium follows the iproute2 model.The example also demonstrates how BPF helper functions are mapped into the C code and being used. Here,
map_lookup_elem()
is defined by mapping this function into theBPF_FUNC_map_lookup_elem
enum value which is exposed as a helper inuapi/linux/bpf.h
. When the program is later loaded into the kernel, the verifier checks whether the passed arguments are of the expected type and re-points the helper call into a real function call. Moreover,map_lookup_elem()
also demonstrates how maps can be passed to BPF helper functions. Here,&acc_map
from themaps
section is passed as the first argument tomap_lookup_elem()
.Since the defined array map is global, the accounting needs to use an atomic operation, which is defined as
lock_xadd()
. LLVM maps__sync_fetch_and_add()
as a built-in function to the BPF atomic add instruction, that is,BPF_STX | BPF_XADD | BPF_W
for word sizes.Last but not least, the
struct bpf_elf_map
tells that the map is to be pinned asPIN_GLOBAL_NS
. This means that tc will pin the map into the BPF pseudo file system as a node. By default, it will be pinned to/sys/fs/bpf/tc/globals/acc_map
for the given example. Due to thePIN_GLOBAL_NS
, the map will be placed under/sys/fs/bpf/tc/globals/
.globals
acts as a global namespace that spans across object files. If the example usedPIN_OBJECT_NS
, then tc would create a directory that is local to the object file. For example, different C files with BPF code could have the sameacc_map
definition as above with aPIN_GLOBAL_NS
pinning. In that case, the map will be shared among BPF programs originating from various object files.PIN_NONE
would mean that the map is not placed into the BPF file system as a node, and as a result will not be accessible from user space after tc quits. It would also mean that tc creates two separate map instances for each program, since it cannot retrieve a previously pinned map under that name. Theacc_map
part from the mentioned path is the name of the map as specified in the source code.Thus, upon loading of the
ingress
program, tc will find that no such map exists in the BPF file system and creates a new one. On success, the map will also be pinned, so that when theegress
program is loaded through tc, it will find that such map already exists in the BPF file system and will reuse that for theegress
program. The loader also makes sure in case maps exist with the same name that also their properties (key / value size, etc) match.Just like tc can retrieve the same map, also third party applications can use the
BPF_OBJ_GET
command from the bpf system call in order to create a new file descriptor pointing to the same map instance, which can then be used to lookup / update / delete map elements.The code can be compiled and loaded via iproute2 as follows:
$ clang -O2 -Wall --target=bpf -c tc-example.c -o tc-example.o # tc qdisc add dev em1 clsact # tc filter add dev em1 ingress bpf da obj tc-example.o sec ingress # tc filter add dev em1 egress bpf da obj tc-example.o sec egress # tc filter show dev em1 ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[ingress] direct-action id 1 tag c5f7825e5dac396f # tc filter show dev em1 egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[egress] direct-action id 2 tag b2fd5adc0f262714 # mount | grep bpf sysfs on /sys/fs/bpf type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) bpf on /sys/fs/bpf type bpf (rw,relatime,mode=0700) # tree /sys/fs/bpf/ /sys/fs/bpf/ +-- ip -> /sys/fs/bpf/tc/ +-- tc | +-- globals | +-- acc_map +-- xdp -> /sys/fs/bpf/tc/ 4 directories, 1 fileAs soon as packets pass the
em1
device, counters from the BPF map will be increased.
There are no global variables allowed.
For the reasons already mentioned in point 1, BPF cannot have global variables as often used in normal C programs.
However, there is a work-around in that the program can simply use a BPF map of type
BPF_MAP_TYPE_PERCPU_ARRAY
with just a single slot of arbitrary value size. This works, because during execution, BPF programs are guaranteed to never get preempted by the kernel and therefore can use the single map entry as a scratch buffer for temporary data, for example, to extend beyond the stack limitation. This also functions across tail calls, since it has the same guarantees with regards to preemption.Otherwise, for holding state across multiple BPF program runs, normal BPF maps can be used.
There are no const strings or arrays allowed.
Defining
const
strings or other arrays in the BPF C program does not work for the same reasons as pointed out in sections 1 and 3, which is, that relocation entries will be generated in the ELF file which will be rejected by loaders due to not being part of the ABI towards loaders (loaders also cannot fix up such entries as it would require large rewrites of the already compiled BPF sequence).In the future, LLVM might detect these occurrences and early throw an error to the user.
Helper functions such as
trace_printk()
can be worked around as follows:static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...); #ifndef printk # define printk(fmt, ...) \ ({ \ char ____fmt[] = fmt; \ trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \ }) #endifThe program can then use the macro naturally like
printk("skb len:%u\n", skb->len);
. The output will then be written to the trace pipe.tc exec bpf dbg
can be used to retrieve the messages from there.The use of the
trace_printk()
helper function has a couple of disadvantages and thus is not recommended for production usage. Constant strings like the"skb len:%u\n"
need to be loaded into the BPF stack each time the helper function is called, but also BPF helper functions are limited to a maximum of 5 arguments. This leaves room for only 3 additional variables which can be passed for dumping.Therefore, despite being helpful for quick debugging, it is recommended (for networking programs) to use the
skb_event_output()
or thexdp_event_output()
helper, respectively. They allow for passing custom structs from the BPF program to the perf event ring buffer along with an optional packet sample. For example, Cilium’s monitor makes use of these helpers in order to implement a debugging framework, notifications for network policy violations, etc. These helpers pass the data through a lockless memory mapped per-CPUperf
ring buffer, and is thus significantly faster thantrace_printk()
.
Use of LLVM built-in functions for memset()/memcpy()/memmove()/memcmp().
Since BPF programs cannot perform any function calls other than those to BPF helpers, common library code needs to be implemented as inline functions. In addition, also LLVM provides some built-ins that the programs can use for constant sizes (here:
n
) which will then always get inlined:#ifndef memset # define memset(dest, chr, n) __builtin_memset((dest), (chr), (n)) #endif #ifndef memcpy # define memcpy(dest, src, n) __builtin_memcpy((dest), (src), (n)) #endif #ifndef memmove # define memmove(dest, src, n) __builtin_memmove((dest), (src), (n)) #endifThe
memcmp()
built-in had some corner cases where inlining did not take place due to an LLVM issue in the back end, and is therefore not recommended to be used until the issue is fixed.
There are no loops available (yet).
The BPF verifier in the kernel checks that a BPF program does not contain loops by performing a depth first search of all possible program paths besides other control flow graph validations. The purpose is to make sure that the program is always guaranteed to terminate.
A very limited form of looping is available for constant upper loop bounds by using
#pragma unroll
directive. Example code that is compiled to BPF:#pragma unroll for (i = 0; i < IPV6_MAX_HEADERS; i++) { switch (nh) { case NEXTHDR_NONE: return DROP_INVALID_EXTHDR; case NEXTHDR_FRAGMENT: return DROP_FRAG_NOSUPPORT; case NEXTHDR_HOP: case NEXTHDR_ROUTING: case NEXTHDR_AUTH: case NEXTHDR_DEST: if (skb_load_bytes(skb, l3_off + len, &opthdr, sizeof(opthdr)) < 0) return DROP_INVALID; nh = opthdr.nexthdr; if (nh == NEXTHDR_AUTH) len += ipv6_authlen(&opthdr); else len += ipv6_optlen(&opthdr); break; default: *nexthdr = nh; return len; } }Another possibility is to use tail calls by calling into the same program again and using a
BPF_MAP_TYPE_PERCPU_ARRAY
map for having a local scratch space. While being dynamic, this form of looping however is limited to a maximum of 34 iterations (the initial program, plus 33 iterations from the tail calls).In the future, BPF may have some native, but limited form of implementing loops.
Partitioning programs with tail calls.
Tail calls provide the flexibility to atomically alter program behavior during runtime by jumping from one BPF program into another. In order to select the next program, tail calls make use of program array maps (
BPF_MAP_TYPE_PROG_ARRAY
), and pass the map as well as the index to the next program to jump to. There is no return to the old program after the jump has been performed, and in case there was no program present at the given map index, then execution continues on the original program.For example, this can be used to implement various stages of a parser, where such stages could be updated with new parsing features during runtime.
Another use case are event notifications, for example, Cilium can opt in packet drop notifications during runtime, where the
skb_event_output()
call is located inside the tail called program. Thus, during normal operations, the fall-through path will always be executed unless a program is added to the related map index, where the program then prepares the metadata and triggers the event notification to a user space daemon.Program array maps are quite flexible, enabling also individual actions to be implemented for programs located in each map index. For example, the root program attached to XDP or tc could perform an initial tail call to index 0 of the program array map, performing traffic sampling, then jumping to index 1 of the program array map, where firewalling policy is applied and the packet either dropped or further processed in index 2 of the program array map, where it is mangled and sent out of an interface again. Jumps in the program array map can, of course, be arbitrary. The kernel will eventually execute the fall-through path when the maximum tail call limit has been reached.
Minimal example extract of using tail calls:
[...] #ifndef __stringify # define __stringify(X) #X #endif #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif #ifndef __section_tail # define __section_tail(ID, KEY) \ __section(__stringify(ID) "/" __stringify(KEY)) #endif #ifndef BPF_FUNC # define BPF_FUNC(NAME, ...) \ (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME #endif #define BPF_JMP_MAP_ID 1 static void BPF_FUNC(tail_call, struct __sk_buff *skb, void *map, uint32_t index); struct bpf_elf_map jmp_map __section("maps") = { .type = BPF_MAP_TYPE_PROG_ARRAY, .id = BPF_JMP_MAP_ID, .size_key = sizeof(uint32_t), .size_value = sizeof(uint32_t), .pinning = PIN_GLOBAL_NS, .max_elem = 1, }; __section_tail(BPF_JMP_MAP_ID, 0) int looper(struct __sk_buff *skb) { printk("skb cb: %u\n", skb->cb[0]++); tail_call(skb, &jmp_map, 0); return TC_ACT_OK; } __section("prog") int entry(struct __sk_buff *skb) { skb->cb[0] = 0; tail_call(skb, &jmp_map, 0); return TC_ACT_OK; } char __license[] __section("license") = "GPL";When loading this toy program, tc will create the program array and pin it to the BPF file system in the global namespace under
jmp_map
. Also, the BPF ELF loader in iproute2 will also recognize sections that are marked as__section_tail()
. The providedid
instruct bpf_elf_map
will be matched against the id marker in the__section_tail()
, that is,JMP_MAP_ID
, and the program therefore loaded at the user specified program array map index, which is0
in this example. As a result, all provided tail call sections will be populated by the iproute2 loader to the corresponding maps. This mechanism is not specific to tc, but can be applied with any other BPF program type that iproute2 supports (such as XDP, lwt).The generated elf contains section headers describing the map id and the entry within that map:
$ llvm-objdump -S --no-show-raw-insn prog_array.o | less prog_array.o: file format ELF64-BPF Disassembly of section 1/0: looper: 0: r6 = r1 1: r2 = *(u32 *)(r6 + 48) 2: r1 = r2 3: r1 += 1 4: *(u32 *)(r6 + 48) = r1 5: r1 = 0 ll 7: call -1 8: r1 = r6 9: r2 = 0 ll 11: r3 = 0 12: call 12 13: r0 = 0 14: exit Disassembly of section prog: entry: 0: r2 = 0 1: *(u32 *)(r1 + 48) = r2 2: r2 = 0 ll 4: r3 = 0 5: call 12 6: r0 = 0 7: exiIn this case, the
section 1/0
indicates that thelooper()
function resides in the map id1
at position0
.The pinned map can be retrieved by a user space applications (e.g. Cilium daemon), but also by tc itself in order to update the map with new programs. Updates happen atomically, the initial entry programs that are triggered first from the various subsystems are also updated atomically.
Example for tc to perform tail call map updates:
# tc exec bpf graft m:globals/jmp_map key 0 obj new.o sec fooIn case iproute2 would update the pinned program array, the
graft
command can be used. By pointing it toglobals/jmp_map
, tc will update the map at index / key0
with a new program residing in the object filenew.o
under sectionfoo
.
Limited stack space of maximum 512 bytes.
Stack space in BPF programs is limited to only 512 bytes, which needs to be taken into careful consideration when implementing BPF programs in C. However, as mentioned earlier in point 3, a
BPF_MAP_TYPE_PERCPU_ARRAY
map with a single entry can be used in order to enlarge scratch buffer space.
Use of BPF inline assembly possible.
LLVM 6.0 or later allows use of inline assembly for BPF for the rare cases where it might be needed. The following (nonsense) toy example shows a 64 bit atomic add. Due to lack of documentation, LLVM source code in
lib/Target/BPF/BPFInstrInfo.td
as well astest/CodeGen/BPF/
might be helpful for providing some additional examples. Test code:#include <linux/bpf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif __section("prog") int xdp_test(struct xdp_md *ctx) { __u64 a = 2, b = 3, *c = &a; /* just a toy xadd example to show the syntax */ asm volatile("lock *(u64 *)(%0+0) += %1" : "=r"(c) : "r"(b), "0"(c)); return a; } char __license[] __section("license") = "GPL";The above program is compiled into the following sequence of BPF instructions:
Verifier analysis: 0: (b7) r1 = 2 1: (7b) *(u64 *)(r10 -8) = r1 2: (b7) r1 = 3 3: (bf) r2 = r10 4: (07) r2 += -8 5: (db) lock *(u64 *)(r2 +0) += r1 6: (79) r0 = *(u64 *)(r10 -8) 7: (95) exit processed 8 insns (limit 131072), stack depth 8
Remove struct padding with aligning members by using #pragma pack.
In modern compilers, data structures are aligned by default to access memory efficiently. Structure members are packed to memory addresses and padding is added for the proper alignment with the processor word size (e.g. 8-byte for 64-bit processors, 4-byte for 32-bit processors). Because of this, the size of struct may often grow larger than expected.
struct called_info { u64 start; // 8-byte u64 end; // 8-byte u32 sector; // 4-byte }; // size of 20-byte ? printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte // Actual compiled composition of struct called_info // 0x0(0) 0x8(8) // ↓________________________↓ // | start (8) | // |________________________| // | end (8) | // |________________________| // | sector(4) | PADDING | <= address aligned to 8 // |____________|___________| with 4-byte PADDING.The BPF verifier in the kernel checks the stack boundary that a BPF program does not access outside of boundary or uninitialized stack area. Using struct with the padding as a map value, will cause
invalid indirect read from stack
failure onbpf_prog_load()
.Example code:
struct called_info { u64 start; u64 end; u32 sector; }; struct bpf_map_def SEC("maps") called_info_map = { .type = BPF_MAP_TYPE_HASH, .key_size = sizeof(long), .value_size = sizeof(struct called_info), .max_entries = 4096, }; SEC("kprobe/submit_bio") int submit_bio_entry(struct pt_regs *ctx) { char fmt[] = "submit_bio(bio=0x%lx) called: %llu\n"; u64 start_time = bpf_ktime_get_ns(); long bio_ptr = PT_REGS_PARM1(ctx); struct called_info called_info = { .start = start_time, .end = 0, .sector = 0 }; bpf_map_update_elem(&called_info_map, &bio_ptr, &called_info, BPF_ANY); bpf_trace_printk(fmt, sizeof(fmt), bio_ptr, start_time); return 0; }Corresponding output on
bpf_load_program()
:bpf_load_program() err=13 0: (bf) r6 = r1 ... 19: (b7) r1 = 0 20: (7b) *(u64 *)(r10 -72) = r1 21: (7b) *(u64 *)(r10 -80) = r7 22: (63) *(u32 *)(r10 -64) = r1 ... 30: (85) call bpf_map_update_elem#2 invalid indirect read from stack off -80+20 size 24At
bpf_prog_load()
, an eBPF verifierbpf_check()
is called, and it’ll check stack boundary by callingcheck_func_arg() -> check_stack_boundary()
. From the upper error shows,struct called_info
is compiled to 24-byte size, and the message says reading a data from +20 is an invalid indirect read. And as we discussed earlier, the address 0x14(20) is the place where PADDING is.// Actual compiled composition of struct called_info // 0x10(16) 0x14(20) 0x18(24) // ↓____________↓___________↓ // | sector(4) | PADDING | <= address aligned to 8 // |____________|___________| with 4-byte PADDING.The
check_stack_boundary()
internally loops through the everyaccess_size
(24) byte from the start pointer to make sure that it’s within stack boundary and all elements of the stack are initialized. Since the padding isn’t supposed to be used, it gets the ‘invalid indirect read from stack’ failure. To avoid this kind of failure, removing the padding from the struct is necessary.Removing the padding by using
#pragma pack(n)
directive:#pragma pack(4) struct called_info { u64 start; // 8-byte u64 end; // 8-byte u32 sector; // 4-byte }; // size of 20-byte ? printf("size of %d-byte\n", sizeof(struct called_info)); // size of 20-byte // Actual compiled composition of packed struct called_info // 0x0(0) 0x8(8) // ↓________________________↓ // | start (8) | // |________________________| // | end (8) | // |________________________| // | sector(4) | <= address aligned to 4 // |____________| with no PADDING.By locating
#pragma pack(4)
before ofstruct called_info
, the compiler will align members of a struct to the least of 4-byte and their natural alignment. As you can see, the size ofstruct called_info
has been shrunk to 20-byte and the padding no longer exists.But, removing the padding has downsides too. For example, the compiler will generate less optimized code. Since we’ve removed the padding, processors will conduct unaligned access to the structure and this might lead to performance degradation. And also, unaligned access might get rejected by verifier on some architectures.
However, there is a way to avoid downsides of packed structure. By simply adding the explicit padding
u32 pad
member at the end will resolve the same problem without packing of the structure.struct called_info { u64 start; // 8-byte u64 end; // 8-byte u32 sector; // 4-byte u32 pad; // 4-byte }; // size of 24-byte ? printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte // Actual compiled composition of struct called_info with explicit padding // 0x0(0) 0x8(8) // ↓________________________↓ // | start (8) | // |________________________| // | end (8) | // |________________________| // | sector(4) | pad (4) | <= address aligned to 8 // |____________|___________| with explicit PADDING.
Accessing packet data via invalidated references
Some networking BPF helper functions such as
bpf_skb_store_bytes
might change the size of a packet data. As verifier is not able to track such changes, any a priori reference to the data will be invalidated by verifier. Therefore, the reference needs to be updated before accessing the data to avoid verifier rejecting a program.To illustrate this, consider the following snippet:
struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN; skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0); if (ip4->protocol == IPPROTO_TCP) { // do something }Verifier will reject the snippet due to dereference of the invalidated
ip4->protocol
:R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=34,r=34,imm=0) R3=inv0 R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R8=inv4294967162 R9=pkt(id=0,off=0,r=34,imm=0) R10=fp0,call_-1 ... 18: (85) call bpf_skb_store_bytes#9 19: (7b) *(u64 *)(r10 -56) = r7 R0=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=2,var_off=(0x0; 0x3)) R8=inv4294967162 R9=inv(id=0) R10=fp0,call_-1 fp-48=mmmm???? fp-56=mmmmmmmm 21: (61) r1 = *(u32 *)(r9 +23) R9 invalid mem access 'inv'To fix this, the reference to
ip4
has to be updated:struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN; skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0); ip4 = (struct iphdr *) skb->data + ETH_HLEN; if (ip4->protocol == IPPROTO_TCP) { // do something }
iproute2
There are various front ends for loading BPF programs into the kernel such as bcc,
perf, iproute2 and others. The Linux kernel source tree also provides a user space
library under tools/lib/bpf/
, which is mainly used and driven by perf for
loading BPF tracing programs into the kernel. However, the library itself is
generic and not limited to perf only. bcc is a toolkit providing many useful
BPF programs mainly for tracing that are loaded ad-hoc through a Python interface
embedding the BPF C code. Syntax and semantics for implementing BPF programs
slightly differ among front ends in general, though. Additionally, there are also
BPF samples in the kernel source tree (samples/bpf/
) which parse the generated
object files and load the code directly through the system call interface.
This and previous sections mainly focus on the iproute2 suite’s BPF front end for loading networking programs of XDP, tc or lwt type, since Cilium’s programs are implemented against this BPF loader. In future, Cilium will be equipped with a native BPF loader, but programs will still be compatible to be loaded through iproute2 suite in order to facilitate development and debugging.
All BPF program types supported by iproute2 share the same BPF loader logic
due to having a common loader back end implemented as a library (lib/bpf.c
in iproute2 source tree).
The previous section on LLVM also covered some iproute2 parts related to writing BPF C programs, and later sections in this document are related to tc and XDP specific aspects when writing programs. Therefore, this section will rather focus on usage examples for loading object files with iproute2 as well as some of the generic mechanics of the loader. It does not try to provide a complete coverage of all details, but enough for getting started.
1. Loading of XDP BPF object files.
Given a BPF object file
prog.o
has been compiled for XDP, it can be loaded throughip
to a XDP-supported netdevice calledem1
with the following command:# ip link set dev em1 xdp obj prog.oThe above command assumes that the program code resides in the default section which is called
prog
in XDP case. Should this not be the case, and the section is named differently, for example,foobar
, then the program needs to be loaded as:# ip link set dev em1 xdp obj prog.o sec foobarNote that it is also possible to load the program out of the
.text
section. Changing the minimal, stand-alone XDP drop program by removing the__section()
annotation from thexdp_drop
entry point would look like the following:#include <linux/bpf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif int xdp_drop(struct xdp_md *ctx) { return XDP_DROP; } char __license[] __section("license") = "GPL";And can be loaded as follows:
# ip link set dev em1 xdp obj prog.o sec .textBy default,
ip
will throw an error in case a XDP program is already attached to the networking interface, to prevent it from being overridden by accident. In order to replace the currently running XDP program with a new one, the-force
option must be used:# ip -force link set dev em1 xdp obj prog.oMost XDP-enabled drivers today support an atomic replacement of the existing program with a new one without traffic interruption. There is always only a single program attached to an XDP-enabled driver due to performance reasons, hence a chain of programs is not supported. However, as described in the previous section, partitioning of programs can be performed through tail calls to achieve a similar use case when necessary.
The
ip link
command will display anxdp
flag if the interface has an XDP program attached.ip link | grep xdp
can thus be used to find all interfaces that have XDP running. Further introspection facilities are provided through the detailed view withip -d link
andbpftool
can be used to retrieve information about the attached program based on the BPF program ID shown in theip link
dump.In order to remove the existing XDP program from the interface, the following command must be issued:
# ip link set dev em1 xdp offIn the case of switching a driver’s operation mode from non-XDP to native XDP and vice versa, typically the driver needs to reconfigure its receive (and transmit) rings in order to ensure received packet are set up linearly within a single page for BPF to read and write into. However, once completed, then most drivers only need to perform an atomic replacement of the program itself when a BPF program is requested to be swapped.
In total, XDP supports three operation modes which iproute2 implements as well:
xdpdrv
,xdpoffload
andxdpgeneric
.
xdpdrv
stands for native XDP, meaning the BPF program is run directly in the driver’s receive path at the earliest possible point in software. This is the normal / conventional XDP mode and requires driver’s to implement XDP support, which all major 10G/40G/+ networking drivers in the upstream Linux kernel already provide.
xdpgeneric
stands for generic XDP and is intended as an experimental test bed for drivers which do not yet support native XDP. Given the generic XDP hook in the ingress path comes at a much later point in time when the packet already enters the stack’s main receive path as askb
, the performance is significantly less than with processing inxdpdrv
mode.xdpgeneric
therefore is for the most part only interesting for experimenting, less for production environments.Last but not least, the
xdpoffload
mode is implemented by SmartNICs such as those supported by Netronome’s nfp driver and allow for offloading the entire BPF/XDP program into hardware, thus the program is run on each packet reception directly on the card. This provides even higher performance than running in native XDP although not all BPF map types or BPF helper functions are available for use compared to native XDP. The BPF verifier will reject the program in such case and report to the user what is unsupported. Other than staying in the realm of supported BPF features and helper functions, no special precautions have to be taken when writing BPF C programs.When a command like
ip link set dev em1 xdp obj [...]
is used, then the kernel will attempt to load the program first as native XDP, and in case the driver does not support native XDP, it will automatically fall back to generic XDP. Thus, for example, using explicitlyxdpdrv
instead ofxdp
, the kernel will only attempt to load the program as native XDP and fail in case the driver does not support it, which provides a guarantee that generic XDP is avoided altogether.Example for enforcing a BPF/XDP program to be loaded in native XDP mode, dumping the link details and unloading the program again:
# ip -force link set dev em1 xdpdrv obj prog.o # ip link show [...] 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff prog/xdp id 1 tag 57cd311f2e27366b [...] # ip link set dev em1 xdpdrv offSame example now for forcing generic XDP, even if the driver would support native XDP, and additionally dumping the BPF instructions of the attached dummy program through bpftool:
# ip -force link set dev em1 xdpgeneric obj prog.o # ip link show [...] 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff prog/xdp id 4 tag 57cd311f2e27366b <-- BPF program ID 4 [...] # bpftool prog dump xlated id 4 <-- Dump of instructions running on em1 0: (b7) r0 = 1 1: (95) exit # ip link set dev em1 xdpgeneric offAnd last but not least offloaded XDP, where we additionally dump program information via bpftool for retrieving general metadata:
# ip -force link set dev em1 xdpoffload obj prog.o # ip link show [...] 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff prog/xdp id 8 tag 57cd311f2e27366b [...] # bpftool prog show id 8 8: xdp tag 57cd311f2e27366b dev em1 <-- Also indicates a BPF program offloaded to em1 loaded_at Apr 11/20:38 uid 0 xlated 16B not jited memlock 4096B # ip link set dev em1 xdpoffload offNote that it is not possible to use
xdpdrv
andxdpgeneric
or other modes at the same time, meaning only one of the XDP operation modes must be picked.A switch between different XDP modes e.g. from generic to native or vice versa is not atomically possible. Only switching programs within a specific operation mode is:
# ip -force link set dev em1 xdpgeneric obj prog.o # ip -force link set dev em1 xdpoffload obj prog.o RTNETLINK answers: File exists # ip -force link set dev em1 xdpdrv obj prog.o RTNETLINK answers: File exists # ip -force link set dev em1 xdpgeneric obj prog.o <-- Succeeds due to xdpgeneric #Switching between modes requires to first leave the current operation mode in order to then enter the new one:
# ip -force link set dev em1 xdpgeneric obj prog.o # ip -force link set dev em1 xdpgeneric off # ip -force link set dev em1 xdpoffload obj prog.o # ip l [...] 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff prog/xdp id 17 tag 57cd311f2e27366b [...] # ip -force link set dev em1 xdpoffload off
2. Loading of tc BPF object files.
Given a BPF object file
prog.o
has been compiled for tc, it can be loaded through the tc command to a netdevice. Unlike XDP, there is no driver dependency for supporting attaching BPF programs to the device. Here, the netdevice is calledem1
, and with the following command the program can be attached to the networkingingress
path ofem1
:# tc qdisc add dev em1 clsact # tc filter add dev em1 ingress bpf da obj prog.oThe first step is to set up a
clsact
qdisc (Linux queueing discipline).clsact
is a dummy qdisc similar to theingress
qdisc, which can only hold classifier and actions, but does not perform actual queueing. It is needed in order to attach thebpf
classifier. Theclsact
qdisc provides two special hooks calledingress
andegress
, where the classifier can be attached to. Bothingress
andegress
hooks are located in central receive and transmit locations in the networking data path, where every packet on the device passes through. Theingress
hook is called from__netif_receive_skb_core() -> sch_handle_ingress()
in the kernel and theegress
hook from__dev_queue_xmit() -> sch_handle_egress()
.The equivalent for attaching the program to the
egress
hook looks as follows:# tc filter add dev em1 egress bpf da obj prog.oThe
clsact
qdisc is processed lockless fromingress
andegress
direction and can also be attached to virtual, queue-less devices such asveth
devices connecting containers.Next to the hook, the
tc filter
command selectsbpf
to be used inda
(direct-action) mode.da
mode is recommended and should always be specified. It basically means that thebpf
classifier does not need to call into external tc action modules, which are not necessary forbpf
anyway, since all packet mangling, forwarding or other kind of actions can already be performed inside the single BPF program which is to be attached, and is therefore significantly faster.At this point, the program has been attached and is executed once packets traverse the device. Like in XDP, should the default section name not be used, then it can be specified during load, for example, in case of section
foobar
:# tc filter add dev em1 egress bpf da obj prog.o sec foobariproute2’s BPF loader allows for using the same command line syntax across program types, hence the
obj prog.o sec foobar
is the same syntax as with XDP mentioned earlier.The attached programs can be listed through the following commands:
# tc filter show dev em1 ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 prog.o:[ingress] direct-action id 1 tag c5f7825e5dac396f # tc filter show dev em1 egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 prog.o:[egress] direct-action id 2 tag b2fd5adc0f262714The output of
prog.o:[ingress]
tells that program sectioningress
was loaded from the fileprog.o
, andbpf
operates indirect-action
mode. The programid
andtag
is appended for each case, where the latter denotes a hash over the instruction stream which can be correlated with the object file orperf
reports with stack traces, etc. Last but not least, theid
represents the system-wide unique BPF program identifier that can be used along withbpftool
to further inspect or dump the attached BPF program.tc can attach more than just a single BPF program, it provides various other classifiers which can be chained together. However, attaching a single BPF program is fully sufficient since all packet operations can be contained in the program itself thanks to
da
(direct-action
) mode, meaning the BPF program itself will already return the tc action verdict such asTC_ACT_OK
,TC_ACT_SHOT
and others. For optimal performance and flexibility, this is the recommended usage.In the above
show
command, tc also displayspref 49152
andhandle 0x1
next to the BPF related output. Both are auto-generated in case they are not explicitly provided through the command line.pref
denotes a priority number, which means that in case multiple classifiers are attached, they will be executed based on ascending priority, andhandle
represents an identifier in case multiple instances of the same classifier have been loaded under the samepref
. Since in case of BPF, a single program is fully sufficient,pref
andhandle
can typically be ignored.Only in the case where it is planned to atomically replace the attached BPF programs, it would be recommended to explicitly specify
pref
andhandle
a priori on initial load, so that they do not have to be queried at a later point in time for thereplace
operation. Thus, creation becomes:# tc filter add dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar # tc filter show dev em1 ingress filter protocol all pref 1 bpf filter protocol all pref 1 bpf handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396fAnd for the atomic replacement, the following can be issued for updating the existing program at
ingress
hook with the new BPF program from the fileprog.o
in sectionfoobar
:# tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobarLast but not least, in order to remove all attached programs from the
ingress
respectivelyegress
hook, the following can be used:# tc filter del dev em1 ingress # tc filter del dev em1 egressFor removing the entire
clsact
qdisc from the netdevice, which implicitly also removes all attached programs from theingress
andegress
hooks, the below command is provided:# tc qdisc del dev em1 clsacttc BPF programs can also be offloaded if the NIC and driver has support for it similarly as with XDP BPF programs. Netronome’s nfp supported NICs offer both types of BPF offload.
# tc qdisc add dev em1 clsact # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o Error: TC offload is disabled on net device. We have an error talking to the kernelIf the above error is shown, then tc hardware offload first needs to be enabled for the device through ethtool’s
hw-tc-offload
setting:# ethtool -K em1 hw-tc-offload on # tc qdisc add dev em1 clsact # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o # tc filter show dev em1 ingress filter protocol all pref 1 bpf filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366bThe
in_hw
flag confirms that the program has been offloaded to the NIC.Note that BPF offloads for both tc and XDP cannot be loaded at the same time, either the tc or XDP offload option must be selected.
3. Testing BPF offload interface via netdevsim driver.
The netdevsim driver which is part of the Linux kernel provides a dummy driver which implements offload interfaces for XDP BPF and tc BPF programs and facilitates testing kernel changes or low-level user space programs implementing a control plane directly against the kernel’s UAPI.
A netdevsim device can be created as follows:
# modprobe netdevsim // [ID] [PORT_COUNT] # echo "1 1" > /sys/bus/netdevsim/new_device # devlink dev netdevsim/netdevsim1 # devlink port netdevsim/netdevsim1/0: type eth netdev eth0 flavour physical # ip l [...] 4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ffAfter that step, XDP BPF or tc BPF programs can be test loaded as shown in the various examples earlier:
# ip -force link set dev eth0 xdpoffload obj prog.o # ip l [...] 4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 xdpoffload qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff prog/xdp id 16 tag a04f5eef06a7f555
These two workflows are the basic operations to load XDP BPF respectively tc BPF programs with iproute2.
There are other various advanced options for the BPF loader that apply both to XDP and tc, some of them are listed here. In the examples only XDP is presented for simplicity.
1. Verbose log output even on success.
The option
verb
can be appended for loading programs in order to dump the verifier log, even if no error occurred:# ip link set dev em1 xdp obj xdp-example.o verb Prog section 'prog' loaded (5)! - Type: 6 - Instructions: 2 (0 over limit) - License: GPL Verifier analysis: 0: (b7) r0 = 1 1: (95) exit processed 2 insns
2. Load program that is already pinned in BPF file system.
Instead of loading a program from an object file, iproute2 can also retrieve the program from the BPF file system in case some external entity pinned it there and attach it to the device:
# ip link set dev em1 xdp pinned /sys/fs/bpf/progiproute2 can also use the short form that is relative to the detected mount point of the BPF file system:
# ip link set dev em1 xdp pinned m:prog
When loading BPF programs, iproute2 will automatically detect the mounted
file system instance in order to perform pinning of nodes. In case no mounted
BPF file system instance was found, then tc will automatically mount it
to the default location under /sys/fs/bpf/
.
In case an instance has already been found, then it will be used and no additional mount will be performed:
# mkdir /var/run/bpf
# mount --bind /var/run/bpf /var/run/bpf
# mount -t bpf bpf /var/run/bpf
# tc filter add dev em1 ingress bpf da obj tc-example.o sec prog
# tree /var/run/bpf
/var/run/bpf
+-- ip -> /run/bpf/tc/
+-- tc
| +-- globals
| +-- jmp_map
+-- xdp -> /run/bpf/tc/
4 directories, 1 file
By default tc will create an initial directory structure as shown above,
where all subsystem users will point to the same location through symbolic
links for the globals
namespace, so that pinned BPF maps can be reused
among various BPF program types in iproute2. In case the file system instance
has already been mounted and an existing structure already exists, then tc will
not override it. This could be the case for separating lwt
, tc
and
xdp
maps in order to not share globals
among all.
As briefly covered in the previous LLVM section, iproute2 will install a header file upon installation which can be included through the standard include path by BPF programs:
#include <iproute2/bpf_elf.h>
The purpose of this header file is to provide an API for maps and default section names used by programs. It’s a stable contract between iproute2 and BPF programs.
The map definition for iproute2 is struct bpf_elf_map
. Its members have
been covered earlier in the LLVM section of this document.
When parsing the BPF object file, the iproute2 loader will walk through
all ELF sections. It initially fetches ancillary sections like maps
and
license
. For maps
, the struct bpf_elf_map
array will be checked
for validity and whenever needed, compatibility workarounds are performed.
Subsequently all maps are created with the user provided information, either
retrieved as a pinned object, or newly created and then pinned into the BPF
file system. Next the loader will handle all program sections that contain
ELF relocation entries for maps, meaning that BPF instructions loading
map file descriptors into registers are rewritten so that the corresponding
map file descriptors are encoded into the instructions immediate value, in
order for the kernel to be able to convert them later on into map kernel
pointers. After that all the programs themselves are created through the BPF
system call, and tail called maps, if present, updated with the program’s file
descriptors.