Calico: Calico node crash when create failsafe port

Created on 3 Oct 2019  Â·  12Comments  Â·  Source: projectcalico/calico

Current Behavior

kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-7f68dfc8c6-qzb4s   1/1     Running   0          38m
calico-node-4nvhv                          1/1     Running   0          16m
calico-node-dl8ml                          1/1     Running   0          16m
calico-node-h9mrl                          1/1     Running   0          15m
calico-node-khx9d                          1/1     Running   0          15m
calico-node-msvqd                          0/1     Running   0          15m
2019-10-03 04:34:38.163 [WARNING][16449] int_dataplane.go 728: failed to set XDP failsafe ports, disabling XDP: failed to create map (calico_failsafe_ports_v1): exit status 255
Error: map create failed: Operation not permitted

2019-10-03 04:34:38.296 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-824808941): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-824808941'
Error: failed to load object file
 try=0
2019-10-03 04:34:38.356 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-424485992): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-424485992'
Error: failed to load object file
 try=1
2019-10-03 04:34:38.420 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-403440295): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-403440295'
Error: failed to load object file
 try=2
2019-10-03 04:34:38.472 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-154340314): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-154340314'
Error: failed to load object file
 try=3
2019-10-03 04:34:38.529 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-230288753): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-230288753'
Error: failed to load object file
 try=4
2019-10-03 04:34:38.612 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-864290844): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-864290844'
Error: failed to load object file
 try=5
2019-10-03 04:34:38.672 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-075479755): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-075479755'
Error: failed to load object file
 try=6
2019-10-03 04:34:38.732 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-109286318): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-109286318'
Error: failed to load object file
 try=7
2019-10-03 04:34:38.788 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-777518389): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-777518389'
Error: failed to load object file
 try=8
2019-10-03 04:34:38.836 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-444518672): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-444518672'
Error: failed to load object file
 try=9
2019-10-03 04:34:38.836 [PANIC][16449] int_dataplane.go 784: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x19b44a0,0xc0004dc230)

goroutine 1 [running]:
github.com/sirupsen/logrus.Entry.log(0xc00011c050, 0xc000588db0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f2d00000000, ...)
    /go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:112 +0x2d2
github.com/sirupsen/logrus.(*Entry).Panic(0xc0004dc0a0, 0xc000468290, 0x1, 0x1)
    /go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:182 +0x103
github.com/sirupsen/logrus.(*Entry).Panicf(0xc0004dc0a0, 0x1a336bb, 0x2b, 0xc000468340, 0x1, 0x1)
    /go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:230 +0xd4
github.com/sirupsen/logrus.(*Logger).Panicf(0xc00011c050, 0x1a336bb, 0x2b, 0xc000468340, 0x1, 0x1)
    /go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/logger.go:173 +0x86
github.com/sirupsen/logrus.Panicf(...)
    /go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/exported.go:145
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).shutdownXDPCompletely(0xc00040a900)
    /go/pkg/mod/github.com/projectcalico/[email protected]/dataplane/linux/int_dataplane.go:784 +0x2cd
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).doStaticDataplaneConfig(0xc00040a900)
    /go/pkg/mod/github.com/projectcalico/[email protected]/dataplane/linux/int_dataplane.go:729 +0xbaa
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).Start(0xc00040a900)
    /go/pkg/mod/github.com/projectcalico/[email protected]/dataplane/linux/int_dataplane.go:592 +0x2f
github.com/projectcalico/felix/dataplane.StartDataplaneDriver(0xc0004c7900, 0xc00001e9c0, 0xc000681620, 0x1, 0xc0004697d8, 0x0)
    /go/pkg/mod/github.com/projectcalico/[email protected]/dataplane/driver.go:186 +0xf09
github.com/projectcalico/felix/daemon.Run(0x1a05d48, 0x15, 0x1cbbc38, 0x6, 0x1d106e0, 0x28, 0x1ce87a0, 0x18)
    /go/pkg/mod/github.com/projectcalico/[email protected]/daemon/daemon.go:305 +0x1759
main.main()
    /go/src/github.com/projectcalico/node/cmd/calico-node/main.go:100 +0x405

Context

I setup kubernetes 1.15.3 on a fresh intalled cluster by Kubespray
All nodes run Ubuntu 18.04.3 and kernel 5.0.0-29
calico version is 3.7.3 , I also try 3.9.1 and get the same error

I search the code and find the error command is
bpftool map create /sys/fs/bpf/calico/calico_failsafe_ports_v1 type hash key 4 value 1 entries 65535 name calico_failsafe_ports_v1 flags 1
It failed on only one node and after I reinstall docker it also failed

When it will be Operation not permitted ?

Your Environment

  • Calico version:3.7.3/3.9.1
  • Kubernetes version: 1.15.3
  • Operating System and version: Ubuntu 18.04.3 x64 kernel 5.0.0-29
kinbug

All 12 comments

2019-10-03 04:34:38.836 [PANIC][16449] int_dataplane.go 784: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x19b44a0,0xc0004dc230)

is similar to #2901

Does kubespray run calico as non-root? Maybe we need a new permission. As a workaround you should be able to disable the XDP feature.

@fasaxc it run as root
And there is only one node get this error,so I want to know why 😰

Anything obviously different about the bad node? Is it configured as the master? Is it running any particular services?

Some things that might help to find out what's special about that node:

sudo sysctl -a | grep bpf

and

kubectl exec -n kube-system <calico-node-pod-name> mount | grep bpf

I'[m wondering if that node had trouble mounting the BPF file system or if the BPF sysctls are disabling BPF calls.

sudo sysctl -a | grep bpf
kernel.unprivileged_bpf_disabled = 0
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 0
net.core.bpf_jit_limit = 264241152
kubectl exec -n kube-system calico-node-msvqd mount | grep bpf
/sys/fs/bpf on /sys/fs/bpf type bpf (rw,relatime)

It seem all right...

The bad node is not configured as the master

Are you running any policing/enforcing apps on there? For example selinux or something that monitors syscalls?

For anyone seeing this, a workaround should be to disable the XDP feature by setting the FELIX_XDPENABLED=false env var in the calico manifest. We're not sure what's causing the permissions errors on some nodes but not others, would be good to connect on Calico Users slack to investigate.

So, after reading more about this:

  • XDP is a performance optimization
  • Its not working on one of the nodes of the cluster, maybe b/c of the SELinux rules or NIC or something.

But I think this issue is still actionable... because there is still the fact that:

  • shutdownXDPCompletely fails in a panic and
  • the 10 retries all happen within microseconds of each other it looks like.

... After looking more at the code, (1) there is no backoff between the retries for the XDP shutdown thingy and (2) it appears that the tryResync semantics aren't super clear, I think i have a fix for these, I'll file a PR shortly. That will make it so that this logic is more deterministic in failure scenarios, less spammy. I also wonder wether we should consider (3) not panicing, and just moving on - since we know that XPD isnt working to begin with?

@fasaxc @jayunit100 can we close this now that https://github.com/projectcalico/felix/pull/2165 has gone in?

For anyone else that comes across this, I found my issue was a combination of using ubuntu, kernel 5.3 and having secure boot enabled. Some newer kernels are enabling lockdown mode, which breaks BPF. You can read more at this comment and this bug report: Disabling bpf() syscall on kernel lockdown break apps when secure boot is on

thanks for comment @mcmcghee, I'm hitting exactly this

That's it ! @mcmcghee Thank you !

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lwr20 picture lwr20  Â·  5Comments

ffilippopoulos picture ffilippopoulos  Â·  4Comments

wjentner picture wjentner  Â·  5Comments

jpiper picture jpiper  Â·  4Comments

winromulus picture winromulus  Â·  3Comments