Cilium: "'stddef.h' file not found" in dev. VM

Created on 30 Apr 2020  Â·  27Comments  Â·  Source: cilium/cilium

Issue

Trying to run the unit tests from inside the dev. VM results in the following error:

In file included from unit-test.c:8:
/usr/include/stdlib.h:31:10: fatal error: 'stddef.h' file not found
#include <stddef.h>

That error is happening because we recently switched the compiler for the BPF unit tests to Clang. The dev. VM images (4.9 and net-next) don't include a full Clang, but only the strict minimum required to compile BPF programs.

Workaround

The following steps, proposed by André for a similar issue on v1.7, work as a quick fix:

sudo mv /usr/bin/clang{,.bak}
sudo mv /usr/bin/llc{,.bak}
sudo apt-get install -y clang-7 llvm-7
sudo update-alternatives --install /usr/bin/clang clang /usr/lib/llvm-7/bin/clang 1000
sudo update-alternatives --install /usr/bin/llc llc /usr/lib/llvm-7/bin/llc 1000

To revert:

sudo mv /usr/bin/clang{.bak,}
sudo mv /usr/bin/llc{.bak,}

Proper Fix

The proper fix requires to revert part of cilium/packer-ci-build#200, to be able to compile on x86 with Clang.


Reported-by: Nate Sweet nathanjsweet@pm.me

kinbug

All 27 comments

I get segmentation fault on bpf unit test on master with the above workaround on fresh dev VM:

make[2]: Entering directory '/home/vagrant/go/src/github.com/cilium/cilium/test/bpf'
clang -Wall -Wextra -Werror -Wshadow -Wno-unused-parameter -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -I../../bpf/ -I../../bpf/include -I. -D__NR_CPUS__=2 -O2 -target bpf -std=gnu89 -nostdinc -emit-llvm -c elf-demo.c -o - | llc -march=bpf -mcpu=probe -filetype=obj -o elf-demo.o
clang -Wall -Wextra -Werror -Wshadow -Wno-unused-parameter -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -I../../bpf/ -I../../bpf/include -I. -D__NR_CPUS__=2 -O2 -I../../bpf/ unit-test.c -o unit-test
make[2]: Leaving directory '/home/vagrant/go/src/github.com/cilium/cilium/test/bpf'
test/bpf/unit-test
Makefile:216: recipe for target 'unit-tests' failed
make[1]: *** [unit-tests] Segmentation fault
make[1]: Leaving directory '/home/vagrant/go/src/github.com/cilium/cilium'
Makefile:201: recipe for target 'tests' failed
make: *** [tests] Error 2
(gdb) run
Starting program: /home/vagrant/go/src/github.com/cilium/cilium/test/bpf/unit-test 

Program received signal SIGSEGV, Segmentation fault.
0x0000000000470698 in memcmp (x=<error reading variable: Cannot access memory at address 0x7fffff7fefc8>, y=<error reading variable: Cannot access memory at address 0x7fffff7fefc0>, 
    len=<error reading variable: Cannot access memory at address 0x7fffff7fefb8>) at ../../bpf/include/linux/../bpf/builtins.h:322
322 {

Some sort of memcmp loop:

#0  0x0000000000470698 in memcmp (x=<error reading variable: Cannot access memory at address 0x7fffff7fefc8>, y=<error reading variable: Cannot access memory at address 0x7fffff7fefc0>,
    len=<error reading variable: Cannot access memory at address 0x7fffff7fefb8>) at ../../bpf/include/linux/../bpf/builtins.h:322
#1  0x00000000004706e5 in __bpf_memcmp_builtin (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:252
#2  __bpf_memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:301
#3  memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:323
#4  0x00000000004706e5 in __bpf_memcmp_builtin (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:252
#5  __bpf_memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:301
#6  memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:323
#7  0x00000000004706e5 in __bpf_memcmp_builtin (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:252
#8  __bpf_memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:301
#9  memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:323
#10 0x00000000004706e5 in __bpf_memcmp_builtin (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:252
#11 __bpf_memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:301
#12 memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:323
#13 0x00000000004706e5 in __bpf_memcmp_builtin (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:252
#14 __bpf_memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:301
#15 memcmp (x=0x7fffffffd268, y=0x7fffffffd260, len=1) at ../../bpf/include/linux/../bpf/builtins.h:323

Line 252 is the #else of this: │258 #if __clang_major__ >= 10 │

@borkmann Any way to get the bpf unit test to pass on the dev VM?

@borkmann Any way to get the bpf unit test to pass on the dev VM?

Hm, weird. I'm not using the dev VM; what is different in this environment from say Travis CI where it seems to pass? Would upgrading to clang-10 fix it, though bit puzzled why SIGSEGV is hit?

Maybe it runs out of stack? The infinite loop/recursion between the three memcmp functions is conditional on __clang_major__ being less than 10, so I'd think updating clang would help.

@borkmann ^^^

Maybe it runs out of stack? The infinite loop/recursion between the three memcmp functions is conditional on __clang_major__ being less than 10, so I'd think updating clang would help.

Ok, what happens if you change #if __clang_major__ >= 10 to #if 0 and run the builtin memcmp also with clang-10? Still crashing or not?

@borkmann Can't run clang 10 or 11 on the dev VM, see description of this issue. I was applying the workaround provided by @aanm above to downgrade to clang-7.

@borkmann Can't run clang 10 or 11 on the dev VM, see description of this issue. I was applying the workaround provided by @aanm above to downgrade to clang-7.

So precompiled clang-10 version like we pull in travis [0] is not an option here? Would it help alternatively if the packer-ci boxes would build also with x86 backend? Given the cilium-runtime is different from packer-ci, we could enable both backends there.

[0] https://github.com/cilium/cilium/blob/master/.travis/prepare.sh
[1] https://github.com/cilium/packer-ci-build/blob/master/provision/ubuntu/install.sh

I don't see why using .travis/prepare.sh would not work. Can't test that right now though.

Enabling x86 backend in packer-ci would be better, though, less downloading at dev VM start time, if the vagrant box is already available.

@borkmann Maybe do both, so that [0] would bridge us over whenever [1] is out of date?

Using the precompiled clang-10 version "fixes" the segfault. Not sure what is happening here. I'll try to send the PR to update the VM image today.

Using the precompiled clang-10 version "fixes" the segfault. Not sure what is happening here. I'll try to send the PR to update the VM image today.

Even if you do the #if 0 ...

Ok, what happens if you change #if __clang_major__ >= 10 to #if 0 and run the builtin memcmp also with clang-10? Still crashing or not?

... with clang-10?

Ok, what happens if you change #if clang_major >= 10 to #if 0 and run the builtin memcmp also with clang-10? Still crashing or not?

Still crashing, even outside the dev. VM. Like Jarno said, there seem to be a memcmp loop, but I'm not sure why __bpf_memcmp_builtin() ends up calling memcmp()... Other builtins are fine.

Ok, what happens if you change #if clang_major >= 10 to #if 0 and run the builtin memcmp also with clang-10? Still crashing or not?

Still crashing, even outside the dev. VM. Like Jarno said, there seem to be a memcmp loop, but I'm not sure why __bpf_memcmp_builtin() ends up calling memcmp()... Other builtins are fine.

Afaik, for __builtin_memcmp() the compiler could still decide to make a call to glibc's memcmp(), for example. But I'm puzzled why it would segfault. Does the same happen when using gcc for x86 compilation?

Afaik, for __builtin_memcmp() the compiler could still decide to make a call to glibc's memcmp(), for example.

Here it's calling our memcmp() implementation, hence the loop.

Afaik, for __builtin_memcmp() the compiler could still decide to make a call to glibc's memcmp(), for example.

Here it's calling our memcmp() implementation, hence the loop.

Got it, just reproduced locally, will look into it.

Afaik, for __builtin_memcmp() the compiler could still decide to make a call to glibc's memcmp(), for example.

Here it's calling our memcmp() implementation, hence the loop.

Got it, just reproduced locally, will look into it.

So renaming our internal memcmp() into memcmp2() fixes the issue. Looks like from the __builtin_memcmp() clang decides to avoid inlining and instead call an available memcmp() and it picks the one we have where we end up in this loop as you mentioned as well. Interesting. :)

Then I'm guessing the only reason it doesn't break on other builtins is because Clang does inline them. The rules for that seem to be a bit different between __builtin_memcmp() and others. I tried switching from void to char because of the following doc. statement but it didn't help:

Constant evaluation support for the __builtin_mem* functions is provided only for arrays of char, signed char, unsigned char, or char8_t, despite these functions accepting an argument of type const void*.

Then I'm guessing the only reason it doesn't break on other builtins is because Clang does inline them. The rules for that seem to be a bit different between __builtin_memcmp() and others. I tried switching from void to char because of the following doc. statement but it didn't help:

Constant evaluation support for the __builtin_mem* functions is provided only for arrays of char, signed char, unsigned char, or char8_t, despite these functions accepting an argument of type const void*.

Yeah, tried that as well earlier. I just switched to __builtin_bcmp() in the PR.

https://github.com/cilium/packer-ci-build/pull/218 fixes the issue in our VM images but we haven't updated vagrant_box_defaults.rb to use the new images yet. Planning to do that today.

I am still able to reproduce the issue :disappointed:

I'm also hitting this issue in the dev VM. @pchaigno Are you planning to check in your fix?

The fix for this is currently blocked by https://github.com/cilium/packer-ci-build/issues/230. I'll try to update the 4.9 and 4.19 VM images to unblock at least these.

Was this page helpful?
0 / 5 - 0 ratings