When failing to send tracing payloads to Datadog, tracing-agent only logs useful information at DEBUG level.
I had a trailing newline character in my API key by mistake, which prevented tracing-agent from sending anything to Datadog (filed #6961 to have the API key sanitized), but the logs said nothing until I read the code and realized I should change the log level. This is where the call reporting a retry is made, and this is where it's processed.
It would be great if the agent logged at least sampled errors at warning level.
Having retries silently happening indefinitely is not what we want.
Some potential solutions:
I don't see any reason to fail a retry. That's what the queue size is for. We should try as long as possible to try and make the best effort not to lose data. Instead, I'd suggest we log the error on every Nth retry (let's say 3). WDYT?
I don't see any reason to fail a retry. That's what the queue size is for. We should try as long as possible to try and make the best effort not to lose data.
+1 on making the best effort not to lose data, but I disagree with the point about retrying infinitely:
That being said, it may be a good idea to split out rethinking retries to a separate issue and only address the logging here.
I believe Datadog Agent logs all request failures, so it's probably ok to do the same in trace-agent. Although I have nothing against sampling of those logs.
Even though that's a more complex implementation I agree with you. Let's treat that as a separate issue though.
Going back to your original issue, if I understood correctly, your payload failed to send because the API key was incorrect. On this occasion the intake returned a 5xx (server error) which the agent considered to be retriable. That's a mistake on our intake and the fix should be added there. This error should be changed to a 4xx (client error) to make sure the client/agent doesn't keep retrying something that will never succeed anyway.
I'm glad we're on the same page here.
The intake is probably fine - in my particular case the requests never made it to Datadog as newline characters are illegal in HTTP headers and Go's net/http package validates that. Here's the debug-level log message I got:
2020-12-09 07:29:06 UTC | TRACE | DEBUG | (pkg/trace/writer/trace.go:254 in recordEvent) | Retrying to flush trace payload; error: Post https://trace.agent.datadoghq.com/api/v0.2/traces: net/http: invalid header field value "***************************e1961\n" for key Dd-Api-Key
~Ok, but that still looks like a mistake on the client side, so it warrants a 4xx? Am I missing something here?~ Nevermind. Let me take a closer look.
Ok, so sorry about that. I misunderstood the problem because of my arriving late at the party.
From what I gather looking around at related issues, the API key "sanitization" PR and issue should resolve this one too. Generally, if everything is to work correctly, it should not be needed to log retries (if the payload is indeed retriable and not an error), only failures to send and drops. Would you agree? Would you be happy if we solved this issue by simply sanitizing the API key and making sure it's a valid HTTP header?
Basically PR https://github.com/DataDog/datadog-agent/pull/7242
I'm still convinced we should log errors when communicating with Datadog API regardless of whether the requests are retries or not the same way Datadog Agent does it.
As a maintainer of trace-agent instances in a cluster, I want to know what causes retries. Retries can happen due to multitude of reasons, bad API key configuration being just one of them: network problems, DNS issues, Datadog backend being down, Datadog backend's rate/concurrency limiters kicking in - all of those cases should be of interest to anyone running trace-agent.
Thanks for explaining. That does make sense. I'm just concerned that logging them might be spammy to some. But I do agree with you. Let's simply change from Debugf to Warnf in that case. That's what you're asking, right? I'm ok with that.
That is a valid concern!
The backoff logic in pkg/trace/writer/sender.go should prevent writing the log messages in a tight loop in case all payloads are being retried due to issues in Datadog backend.
It seems we can still run into problems, if only a few payload sends are being retried while others are succeeding: a message will be logged every few hundred milliseconds, which isn't great. Pairing this behavior with infinite retries leads to no good :)
One thing we could do is keep track of the number of retries per payload to see if logging makes sense (e.g. we reduce logging exponentially with the number of retries). This way we'd avoid excessive logging and maybe additionally log the retry attempt numbers, which would give another piece of information to those troubleshooting.
One thing we could do is keep track of the number of retries per payload to see if logging makes sense (e.g. we reduce logging exponentially with the number of retries). This way we'd avoid excessive logging and maybe additionally log the retry attempt numbers, which would give another piece of information to those troubleshooting.
Yeah I think that's a good approach. Let me sync with the PR author and try and make that happen. Perhaps logging on the 1st retry, and then at other intermediate ones again.
I've suggested that we log on retries number 1, 4, 8, 16, 32, ... Basically the powers of 2 (except 2 because we log on 1 馃檪). I think that should be reasonable.