Datadog-agent: system-probe container crash: "failed to create system probe" (file exists)

Created on 5 May 2020  路  8Comments  路  Source: DataDog/datadog-agent

Output of the info page (if this is a bug)

Not available: system-probe container in crash loopback

Describe what happened:

  • updated datadog-agent daemonSet definition (Kubernetes) in version 7.19.0
  • random system-probe containers bumped into crash loopback with below logs:
2020-05-05 12:34:59 UTC | SYS-PROBE | INFO | (cmd/system-probe/main_common.go:80 in runAgent) | running system-probe with version: Version: 7.19.0, Git hash: 914b7646d, Git branch: HEAD, Build date: 2020-04-29T18:16:52, Go Version: 1.13.8,
2020-05-05 12:35:02 UTC | SYS-PROBE | INFO | (pkg/ebpf/common.go:68 in IsTracerSupportedByOS) | running on platform: linux-5.4.20-12.75.amzn2.x86_64-x86_64-with-glibc2.2.5
2020-05-05 12:35:04 UTC | SYS-PROBE | INFO | (cmd/system-probe/probe.go:46 in CreateSystemProbe) | Creating tracer for: system-probe
2020-05-05 12:35:05 UTC | SYS-PROBE | CRITICAL | (cmd/system-probe/main_common.go:94 in runAgent) | failed to create system probe: could not enable kprobe(kprobe/tcp_get_info) used for offset guessing: cannot write "p:ptcp_get_info tcp_get_info\n" to kprobe_events: write /sys/kernel/debug/tracing/kprobe_events: file exists
  • probes does exists on host system, created by containers from previous pods
# ls -ld /sys/kernel/debug/tracing/events/kprobes/ptcp_get_info/
drwxr-xr-x 2 root root 0 Apr  2 15:15 /sys/kernel/debug/tracing/events/kprobes/ptcp_get_info/

Describe what you expected:

  • system-probe container should gracefully handle kprobes cleanup after receiving SIGTERM from kubelet. This is indicated with similar log:
2020-05-05 12:10:34 UTC | SYS-PROBE | CRITICAL | (pkg/process/util/signal_nowindows.go:21 in HandleSignals) | Caught signal 'terminated'; terminating.
2020-05-05 12:10:34 UTC | SYS-PROBE | DEBUG | (pkg/process/net/uds.go:73 in Stop) | uds: error removing socket file: remove /opt/datadog-agent/run/sysprobe.sock: no such file or directory

Steps to reproduce the issue:
Issue occurs from time to time in our CI pipeline. I was able to reproduce that manually by quickly executing multiple kubectl delete pod especially with --force flag. Container "panics" while receiving second SIGTERM and quits without proper cleanup. Logs generated:

2020-05-05 12:27:30 UTC | SYS-PROBE | CRITICAL | (pkg/process/util/signal_nowindows.go:21 in HandleSignals) | Caught signal 'terminated'; terminating.
2020-05-05 12:27:30 UTC | SYS-PROBE | DEBUG | (pkg/process/net/uds.go:73 in Stop) | uds: error removing socket file: remove /opt/datadog-agent/run/sysprobe.sock: no such file or directory
2020-05-05 12:27:30 UTC | SYS-PROBE | CRITICAL | (pkg/process/util/signal_nowindows.go:21 in HandleSignals) | Caught signal 'terminated'; terminating.
panic: close of closed channel

goroutine 58 [running]:
github.com/DataDog/datadog-agent/pkg/process/util.HandleSignals(0xc000090240)
    /go/src/github.com/DataDog/datadog-agent/pkg/process/util/signal_nowindows.go:22 +0x22c
created by main.runAgent
    /go/src/github.com/DataDog/datadog-agent/cmd/system-probe/main_common.go:104 +0x430

Additional environment details (Operating System, Cloud provider, etc):

teanetworks

Most helpful comment

I have found out that #5200 is going to address that.

All 8 comments

I have found out that #5200 is going to address that.

Hi! I'm closing this as its been addressed by #5200 and released as part of 7.20.0. Thanks for the report!

Hi, we are still hitting this frequently. We are using 7.20.1 and deploying via Helm via Spinnaker. Would it be possible to cleanup /sys/kernel/debug/tracing/kprobe_events on startup?

Same issue still occurring in 7.21.1

2020-08-04 13:56:07 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:395 in func1) | enabling process-agent for connections check as the system-probe is enabled
2020-08-04 13:56:07 UTC | SYS-PROBE | INFO | (cmd/system-probe/main_common.go:82 in runAgent) | running system-probe with version: Version: 7.21.1, Git hash: 83bdc57c7, Git branch: HEAD, Build date: 2020-07-21T17:13:11, Go Version: 1.13.11, 
2020-08-04 13:56:09 UTC | SYS-PROBE | INFO | (pkg/ebpf/common.go:51 in IsTracerSupportedByOS) | running on platform: linux-5.4.50-25.83.amzn2.x86_64-x86_64-with-glibc2.2.5
2020-08-04 13:56:10 UTC | SYS-PROBE | INFO | (cmd/system-probe/probe.go:50 in CreateSystemProbe) | Creating tracer for: system-probe
2020-08-04 13:56:10 UTC | SYS-PROBE | CRITICAL | (cmd/system-probe/main_common.go:96 in runAgent) | failed to create system probe: could not enable kprobe(kprobe/tcp_get_info) used for offset guessing: cannot write "p:ptcp_get_info tcp_get_info\n" to kprobe_events: write /sys/kernel/debug/tracing/kprobe_events: file exists

Similar issue on 7.23.1

2020-11-04 17:47:44 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:460 in func1) | enabling process-agent for connections check as the system-probe is enabled
2020-11-04 17:47:44 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:460 in func1) | network_config found, enabled = true
2020-11-04 17:47:44 UTC | SYS-PROBE | INFO | (cmd/system-probe/main.go:88 in runAgent) | running system-probe with version: Version: 7.23.1, Git hash: 8099db17e, Git branch: HEAD, Build date: 2020-10-20T22:24:33, Go Version: 1.14.7,
2020-11-04 17:47:44 UTC | SYS-PROBE | INFO | (pkg/ebpf/utils_linux.go:84 in IsTracerSupportedByOS) | running on platform: linux-4.19.112+-x86_64-with-glibc2.2.5
2020-11-04 17:47:44 UTC | SYS-PROBE | INFO | (cmd/system-probe/modules/network_tracer.go:43 in func1) | Creating tracer for: system-probe
2020-11-04 17:47:44 UTC | SYS-PROBE | ERROR | (cmd/system-probe/loader.go:39 in Register) | new module `network_tracer` error: error guessing offsets: could not start offset ebpf manager: couldn't start probe kprobe/tcp_getsockopt: couldn't enable kprobe kprobe/tcp_getsockopt: cannot open kprobe_events: open /sys/kernel/debug/tracing/kprobe_events: permission denied
2020-11-04 17:47:44 UTC | SYS-PROBE | INFO | (cmd/system-probe/modules/tcp_queue_tracer.go:19 in func4) | TCP queue length tracer disabled
2020-11-04 17:47:44 UTC | SYS-PROBE | INFO | (cmd/system-probe/modules/oom_kill_probe.go:19 in func2) | OOM kill probe disabled
2020-11-04 17:47:44 UTC | SYS-PROBE | INFO | (pkg/security/module/module.go:181 in NewModule) | security runtime module disabled
2020-11-04 17:47:44 UTC | SYS-PROBE | CRITICAL | (cmd/system-probe/main.go:122 in runAgent) | failed to create system probe: no module could be loaded

The above issue not related. Your error is permission聽denied, and the reasoning is different.
This issue focuses on file聽exists error.

This issue should be resolved. Please re-open if you encounter it again.

@icelynjennings Your problem is indeed different. system-probe should be running as root, so it shouldn't have permission problems. Please double-check how you have the agent/system-probe setup. If you still have problems, please contact support.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tunguyen9889 picture tunguyen9889  路  4Comments

agosto-calvinbehling picture agosto-calvinbehling  路  3Comments

dignajar picture dignajar  路  3Comments

cvele picture cvele  路  3Comments

omb picture omb  路  3Comments