Datadog-agent: Cluster-Agent does not initialise on k3s.

Created on 4 Feb 2020  Â·  7Comments  Â·  Source: DataDog/datadog-agent

Output of the info page (if this is a bug)

<not available>

Describe what happened:
Trying to deploy the cluster-agent on k3s results in connection refused. Sending a flare also crashes.

 agent flare <redacted>
Please enter your email:
<redacted>
Asking the Cluster Agent to build the flare archive.
The agent was unable to make a full flare: Post https://localhost:5005/flare: dial tcp 127.0.0.1:5005: connect: connection refused.
Initiating flare locally, some logs will be mising.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x14a7146]

goroutine 1 [running]:
github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver.convertmetadataMapperBundleToAPI(0x0, 0x0)
/go/src/github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/apiserver.go:420 +0x76
github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver.GetMetadataMapBundleOnAllNodes(0xc0002018c0, 0x0, 0x0, 0xc000b93820)
/go/src/github.com/DataDog/datadog-agent/pkg/util/kubernetes/apiserver/apiserver.go:372 +0x159
github.com/DataDog/datadog-agent/pkg/flare.zipMetadataMap(0xc0001a2e70, 0x22, 0xc000809f10, 0xd, 0x0, 0x0)
/go/src/github.com/DataDog/datadog-agent/pkg/flare/archive_dca.go:180 +0x5a8
github.com/DataDog/datadog-agent/pkg/flare.createDCAArchive(0xc0001a2d80, 0x2a, 0x1d40a01, 0xc0006e7a08, 0x1d72b2c, 0x22, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/DataDog/datadog-agent/pkg/flare/archive_dca.go:111 +0x56e
github.com/DataDog/datadog-agent/pkg/flare.CreateDCAArchive(0x2066401, 0xc0005e8440, 0x1b, 0x1d72b2c, 0x22, 0x3d, 0x0, 0x0, 0x0)
/go/src/github.com/DataDog/datadog-agent/pkg/flare/archive_dca.go:37 +0x255
github.com/DataDog/datadog-agent/cmd/cluster-agent/app.requestFlare(0x7fffd2f89b3c, 0x6, 0x0, 0x0)
/go/src/github.com/DataDog/datadog-agent/cmd/cluster-agent/app/flare.go:105 +0x56c
github.com/DataDog/datadog-agent/cmd/cluster-agent/app.glob..func5(0x33c55e0, 0xc000561040, 0x1, 0x1, 0x0, 0x0)
/go/src/github.com/DataDog/datadog-agent/cmd/cluster-agent/app/flare.go:75 +0x2ca
github.com/DataDog/datadog-agent/vendor/github.com/spf13/cobra.(*Command).execute(0x33c55e0, 0xc000560ff0, 0x1, 0x1, 0x33c55e0, 0xc000560ff0)
/go/src/github.com/DataDog/datadog-agent/vendor/github.com/spf13/cobra/command.go:826 +0x465
github.com/DataDog/datadog-agent/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x33c5d60, 0x1e112d0, 0xc0005dbb00, 0xc)
/go/src/github.com/DataDog/datadog-agent/vendor/github.com/spf13/cobra/command.go:914 +0x2fc
github.com/DataDog/datadog-agent/vendor/github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/DataDog/datadog-agent/vendor/github.com/spf13/cobra/command.go:864
main.main()
/go/src/github.com/DataDog/datadog-agent/cmd/cluster-agent/main.go:36 +0x189

Describe what you expected:

That I can get the cluster agent status and send a flare.

Steps to reproduce the issue:

Bring up a k3s single node cluster in an Ubuntu 18.04 - k3s 0.9.1 in this case. This is a vagrant machine.

Try to bring up the datadog cluster agent and get status or send a flare.

Additional environment details (Operating System, Cloud provider, etc):

As above.

kinbug teacontainers

All 7 comments

Hi @dmarkey - Thanks for bringing this up.
I am looking into it now.

Best,
.C

Just tried on an EKS cluster so not K3S specific.

Interesting, I tried on GKE and was able to get a flare through.
Any way you could reach out to our solutions team so we can gather logs and other relevant details that would help debug ?
.C

I have a case open a few days. No 301546

Also 1.4 seems to get much further FYI.

Hi @dmarkey
We released few days ago the datadog/cluster-agent:1.5.1. Did you get a chance to test it and if it's solve your issue?

To close the loop, with 1.5.0 this is easily reproduceable:

➜  dev k3sctl exec -ti datadog-cluster-agent-69c6846ddd-9s92l bash
root@datadog-cluster-agent-69c6846ddd-9s92l:/# agent status
Getting the status from the agent.

        Could not reach agent: Get https://localhost:5005/status: dial tcp [::1]:5005: connect: connection refused
        Make sure the agent is running before requesting the status.
        Contact support if you continue having issues.Error: Get https://localhost:5005/status: dial tcp [::1]:5005: connect: connection refused
Usage:

The fix released in 1.5.2 does work

➜  dev k3sctl exec -ti datadog-cluster-agent-5879455fd6-fvxft bash
root@datadog-cluster-agent-5879455fd6-fvxft:/# agent status
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.5.2)
==============================

  Status date: 2020-02-20 22:41:52.732369 UTC
[...]

Could you give it a try @dmarkey ?

Was this page helpful?
0 / 5 - 0 ratings