Datadog-agent: question: how to send correct payloads to trace-agent

Created on 26 Mar 2019  路  15Comments  路  Source: DataDog/datadog-agent

This is a follow on to #3172

I am now sending msgpack data to workaround the issue discussed there.

This is my payload, rendered as JSON for convenience:

[[{"service":"tester","start":1553612716261592697,"trace_id":16165123543596688256,"name":"span","resource":"time","duration":800971,"span_id":6701823613192035672}]]

I receive back a 200.

But closer inspection in the logs, which are using level trace, (I'm using datadog/agent:6.10.3-logs-api-key-rc.1 with a valid DD_API_KEY) I see

[ TRACE ] 2019-03-26 15:05:03 INFO (api.go:341) - [] -> traces received: 1, traces dropped: 1, traces filtered: 0, traces amount: 102 bytes, services received: 0, services amount: 0 bytes, events extracted: 0, events sampled: 0

with no reason given.

The question is: why is my trace dropped?

kinquestion teaagent-apm

Most helpful comment

Hi Sam!

From your error message it seems like you're sending the wrong media type in the Content-Type header. Valid media types are:

  • application/json
  • text/json
  • application/msgpack

But maybe I've misread the response message.

Additionally, if you take a look at our span proto definition, you'll notice that the "metrics" map takes values double (float64), not strings, that's why you get the error.

This is a datadog-agent bug: you should set the correct content-type for the payload you're responding with.

You're right. I will fix this.

Also I hope this exercise shows why it would be very useful if the datadog agent returned error codes over HTTP, because we have absolutely no way to integration test our client when it returns 200s and silently drops traces (possibly logging them)

I completely understand your frustration and share the sentiment. Unfortunately the performance benefit we get from this behavior is much more important to us now. You've already figured out yourself what the criteria for valid traces are so you should be good to go.

I'd recommend experimenting locally while you develop the tracer. I'm also here and happy to answer all your questions (that I have an answer to).

As a side note, I developed an integration testing framework which (as part of other things) runs a trace-agent process, sends payloads to it and inspects the logs. So it's definitely doable, even if not as easy as you'd like. It is located here: https://godoc.org/github.com/DataDog/datadog-agent/pkg/trace/test. This is Go however.

All 15 comments

And now I'm seeing three traces in the datadog UI from other similar runs. Only difference is random number used for the span/traceid and start time.

Also, I occassionally see

[ TRACE ] 2019-03-26 15:33:48 INFO (api.go:341) - no data received

despite only sending the same payload.

What is going on? Why is the agent behaving non-deterministically?

In addition, if I attach the _sampling_priority_v1 as discussed in #3031 with @gbbr I get an error from the agent

[[{"metrics":{"_sampling_priority_v1":"2"},"service":"tester","start":1553615499238013807,"trace_id":6684673633581579890,"name":"span","resource":"time","duration":800971,"span_id":7008417499414717876}]]
Left (FailureResponse (Response {responseStatusCode = Status {statusCode = 400, statusMessage = "Bad Request"}, responseHeaders = fromList [("Content-Type","text/plain; charset=utf-8"),("X-Content-Type-Options","nosniff"),("Date","Tue, 26 Mar 2019 15:51:39 GMT"),("Content-Length","63")], responseHttpVersion = HTTP/1.1, responseBody = "msgp: attempted to decode type \"str\" with method for \"float64\"\n"}))

I tried both the 0.3 and 0.4 endpoints.

I'm guessing this is because the stringy map that isn't defined in the schema is expecting a number and not a string. This is a modelling bug, please add this field to your schema.

When I workaround this on my side, I get a decoding problem because the response type is set to text/plain;charset=utf-8 and not JSON or msgpack. This is a datadog-agent bug: you should set the correct content-type for the payload you're responding with.

[[{"service":"tester","start":1553616115814258138,"trace_id":5953352386408593901,"name":"span","resource":"time","duration":800971,"span_id":8916538916538791533}]]
Left (UnsupportedContentType text/plain;charset=utf-8 (Response {responseStatusCode = Status {statusCode = 200, statusMessage = "OK"}, responseHeaders = fromList [("Date","Tue, 26 Mar 2019 16:01:55 GMT"),("Content-Length","40"),("Content-Type","text/plain; charset=utf-8")], responseHttpVersion = HTTP/1.1, responseBody = "{\"rate_by_service\":{\"service:,env:\":1}}\n"}))

We made the code to trigger this available publicly under a free software license. See https://github.com/symbiont-io/haskell-datadog

You probably don't have Haskell installed, but it's easy (caveat, on Windows it maybe isn't so easy, I've never tried)... just install https://www.haskell.org/ghcup/ and type

ghcup install-cabal
ghcup install 8.4.4
ghcup set 8.4.4

and you'll have a fully working Haskell build tool and compiler. You can then follow the instructions in https://github.com/symbiont-io/haskell-datadog/blob/master/manual-test/Main.hs

Also I hope this exercise shows why it would be very useful if the datadog agent returned error codes over HTTP, because we have absolutely no way to integration test our client when it returns 200s and silently drops traces (possibly logging them) :blush:

Hi Sam!

From your error message it seems like you're sending the wrong media type in the Content-Type header. Valid media types are:

  • application/json
  • text/json
  • application/msgpack

But maybe I've misread the response message.

Additionally, if you take a look at our span proto definition, you'll notice that the "metrics" map takes values double (float64), not strings, that's why you get the error.

This is a datadog-agent bug: you should set the correct content-type for the payload you're responding with.

You're right. I will fix this.

Also I hope this exercise shows why it would be very useful if the datadog agent returned error codes over HTTP, because we have absolutely no way to integration test our client when it returns 200s and silently drops traces (possibly logging them)

I completely understand your frustration and share the sentiment. Unfortunately the performance benefit we get from this behavior is much more important to us now. You've already figured out yourself what the criteria for valid traces are so you should be good to go.

I'd recommend experimenting locally while you develop the tracer. I'm also here and happy to answer all your questions (that I have an answer to).

As a side note, I developed an integration testing framework which (as part of other things) runs a trace-agent process, sends payloads to it and inspects the logs. So it's definitely doable, even if not as easy as you'd like. It is located here: https://godoc.org/github.com/DataDog/datadog-agent/pkg/trace/test. This is Go however.

I am using content-type application/msgpack and traces sometimes show up in datadog UI, and sometimes don't (with a log saying the data is either empty or dropped, with no reason given). You can run the code yourself by following the above instructions.

Proof: https://github.com/symbiont-io/haskell-datadog/blob/master/library/Servant/MsgPack.hs#L22

I am using content-type application/msgpack and traces sometimes show up in datadog UI, and sometimes don't (with a log saying the data is either empty or dropped, with no reason given).

Please paste the log entries which say that data is empty or dropped for no reason.

In newer versions of the agent (6.10) there should be specific reasons why traces are dropped.

Please paste the log entries which say that data is empty or dropped for no reason.

I already did:

I sent the exact same payload in both cases.

I'm using datadog/agent:6.10.3-logs-api-key-rc.1

The answer _should_ be there somewhere in the logs, at the DEBUG level entry. In your original comment your payload is missing a "service", which you've pointed out yourself in your list.

If you respect all those rules, all should be well. If it still fails, post the payload and the log outputs. Work locally until you figure it out.

The empty service was in #3172... the payload I'm using here is in the first comment. It sometimes works and sometimes doesn't :man_shrugging:

Is this perhaps related to sampling rates and that sort of thing?

I think you still have a codepath where dropped traces are not logged. I've only seen the logs around the normalisation but there are other places where traces and spans can be dropped.

Is this perhaps related to sampling rates and that sort of thing?

It could, but if you set the sampling priority to 2, it should be kept every time. I'm not entirely certain about our backend but there might be rare scenarios where it choses to drop it anyway. I think it's highly unlikely that this is happening to you. The trace should always show up in "traces received" if the payload was accepted.

There are only two scenarios where traces are dropped:

  • During normalization here.
  • If the agent can't handle it (rate limiting) here.

Both of these will log an error, and if you have DEBUG logging level enabled, the former should also give you more detail.

@gbbr I've re-read the agent code. I agree that these are the only two places where the trace can be dropped in this flow and that a reason is given. However it looks like the log I'm seeing is from the Stats... so the dropping must be happening after handleTraces.

I'm now using 2 in the sampler field, so I'll gather some more experiments before we can say anything more.

It would be good to have a fix for the content-type problem. Please let me know when you have a dev build that I can use.

Problem is fixed in #3212. Next week we should be releasing 6.11.0. Until then, you can try installing the dev. version (https://github.com/DataDog/datadog-agent/blob/master/docs/trace-agent/README.md#development)

For my own dev I was thinking more like when it lands in https://hub.docker.com/r/datadog/agent-dev/tags ... we'll upgrade our network to 6.11.0 when it lands.

Ok, I have more information. This might just be a logging bug in datadog-agent... I started fresh today running the same payload as above. I only ran my tester ONCE, so I could be sure of what I was seeing.

I get exactly two logs from api.go during startup and in response to that payload:

[ TRACE ] 2019-03-28 11:12:29 INFO (api.go:144) - listening for traces at http://localhost:8126
[ TRACE ] 2019-03-28 11:12:39 INFO (api.go:341) - no data received

and nothing else. In particular, I do not see the stats message that I often see. And my trace showed up in the datadog UI. So it was processed and not dropped.

I then send a second trace, same payload (different ids and time), which also shows up in the UI. And I see a log in the agent:

[ TRACE ] 2019-03-28 11:13:49 INFO (api.go:341) - [] -> traces received: 1, traces dropped: 0, traces filtered: 0, traces amount: 138 bytes, services received: 0, services amount: 0 bytes, events extracted: 0, events sampled: 0

I would expect to see traces received: 2 at this point.

It looks like you have an off-by-one somewhere, and the first trace is not logged.

Unfortunately the performance benefit we get from this behavior is much more important to us

It is my advice that you should use a UDP protocol if performance is what you care about. I would probably use it. If you use TCP/HTTP/REST then please adhere to the standard and provide error codes when traces are rejected or sampled, allowing for testability. Writing this integration has been much more difficult than it should have been because of this.

this is superceded by #3232

Was this page helpful?
0 / 5 - 0 ratings