Datadog-agent: proposal: default log level too verbose

Created on 6 Nov 2018 · 13Comments · Source: DataDog/datadog-agent

I just started to forward all my journald logs to Datadog Logs, and I saw that on an (mostly) idle server trace-agent is generating ~30k lines per day.

Most of those seems to be this two:

INFO (stats_writer.go:265) - flushed stat payload to the API, time:324.168702ms, size:2073 b…
INFO (trace_writer.go:97) - flushed trace payload to the API,…

I think that on normal operations the agent should be far less verbose.

This is made worse by the fact that the agent is not setting the log level correctly (see issue #516).

teaagent-apm

Source

renchap

All 13 comments

This has been generally found to be valuable information. Nevertheless, you should be able to run the agent with a different logging level, no? Which takes us to: should we close this as a duplicate of #516? Or is your proposal here to change the default logging level, which I think we can not do at this point.

gbbr on 7 Nov 2018

I am not sure what the correct solution would be here. Changing the log_level works but it also changes it for datadog-agent which is not outputting this many logs in this level.

Another solution would be to move those lines to another log_level (like debug), but I do not know how useful they are for users.

My last idea would be to reduce the frequency for those, either from the start of after the first n lines, like datadog-agent does for checks (ie Done running check memory, next runs will be logged every 20 runs). They may be replaced with 1 line per minute, containing the stats for the last minute?

Also, they are displayed even if there are no traces on the server. I have one server with no trafic but APM enabled, and it is outputting the same amount of those logs as one with production trafic.

renchap on 7 Nov 2018

I think this is a legit request. And you're right, those message should probably go under a different level or be reported less frequently.

gbbr on 7 Nov 2018

Also, they are displayed even if there are no traces on the server. I have one server with no trafic but APM enabled, and it is outputting the same amount of those logs as one with production trafic.

@renchap I'd like to work on improving this aspect of the agent for the next release. It's not very pleasant to have your logs spammed and I agree. It would help me if you could provide a slightly bigger chunk of the logs. You say your server is mostly idle, which makes it even more unusual.

gbbr on 12 Nov 2018

Sure! Here are 90s of logs on a staging server, running a Rails app + Sidekiq, with 0 web requests done during this time:

Nov 12 10:53:14 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:14 INFO (trace_writer.go:97) - flushed trace payload to the API, time:546.235006ms, size:4821 bytes
Nov 12 10:53:19 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:19 INFO (trace_writer.go:97) - flushed trace payload to the API, time:328.391875ms, size:4902 bytes
Nov 12 10:53:24 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:24 INFO (stats_writer.go:265) - flushed stat payload to the API, time:322.23409ms, size:2063 bytes
Nov 12 10:53:24 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:24 INFO (trace_writer.go:97) - flushed trace payload to the API, time:330.552979ms, size:3443 bytes
Nov 12 10:53:29 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:29 INFO (trace_writer.go:97) - flushed trace payload to the API, time:324.64456ms, size:4701 bytes
Nov 12 10:53:34 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:34 INFO (trace_writer.go:97) - flushed trace payload to the API, time:328.680651ms, size:3936 bytes
Nov 12 10:53:34 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:34 INFO (stats_writer.go:265) - flushed stat payload to the API, time:324.699612ms, size:2072 bytes
Nov 12 10:53:39 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:39 INFO (trace_writer.go:97) - flushed trace payload to the API, time:326.441837ms, size:4581 bytes
Nov 12 10:53:44 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:44 INFO (api.go:324) - [lang:ruby lang_version:2.5.1 interpreter:ruby-x86_64-linux tracer_version:0.17.0] -> traces received: 662, traces dropped: 0, traces filtered: 0, traces amount: 307811 bytes, services received: 59, services amount: 2183 bytes
Nov 12 10:53:44 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:44 INFO (trace_writer.go:97) - flushed trace payload to the API, time:335.935148ms, size:4910 bytes
Nov 12 10:53:44 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:44 INFO (stats_writer.go:265) - flushed stat payload to the API, time:327.281065ms, size:2055 bytes
Nov 12 10:53:49 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:49 INFO (trace_writer.go:97) - flushed trace payload to the API, time:322.304145ms, size:4652 bytes
Nov 12 10:53:54 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:54 INFO (trace_writer.go:97) - flushed trace payload to the API, time:330.329691ms, size:4217 bytes
Nov 12 10:53:54 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:54 INFO (stats_writer.go:265) - flushed stat payload to the API, time:323.587486ms, size:2026 bytes
Nov 12 10:53:59 tg-staging2 trace-agent[20791]: 2018-11-12 10:53:59 INFO (trace_writer.go:97) - flushed trace payload to the API, time:366.261862ms, size:4252 bytes
Nov 12 10:54:04 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:04 INFO (stats_writer.go:265) - flushed stat payload to the API, time:320.912179ms, size:2071 bytes
Nov 12 10:54:04 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:04 INFO (trace_writer.go:97) - flushed trace payload to the API, time:369.912593ms, size:4547 bytes
Nov 12 10:54:09 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:09 INFO (trace_writer.go:97) - flushed trace payload to the API, time:328.758929ms, size:4667 bytes
Nov 12 10:54:14 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:14 INFO (service_mapper.go:59) - total number of tracked services: 9
Nov 12 10:54:14 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:14 INFO (trace_writer.go:97) - flushed trace payload to the API, time:419.185318ms, size:4398 bytes
Nov 12 10:54:14 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:14 INFO (stats_writer.go:265) - flushed stat payload to the API, time:412.926405ms, size:2062 bytes
Nov 12 10:54:19 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:19 INFO (trace_writer.go:97) - flushed trace payload to the API, time:433.940838ms, size:4774 bytes
Nov 12 10:54:24 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:24 INFO (stats_writer.go:265) - flushed stat payload to the API, time:323.758772ms, size:2054 bytes
Nov 12 10:54:24 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:24 INFO (trace_writer.go:97) - flushed trace payload to the API, time:336.65624ms, size:4123 bytes
Nov 12 10:54:29 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:29 INFO (trace_writer.go:97) - flushed trace payload to the API, time:324.880409ms, size:4761 bytes
Nov 12 10:54:34 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:34 INFO (trace_writer.go:97) - flushed trace payload to the API, time:327.443028ms, size:4445 bytes
Nov 12 10:54:34 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:34 INFO (stats_writer.go:265) - flushed stat payload to the API, time:326.429154ms, size:2043 bytes
Nov 12 10:54:39 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:39 INFO (trace_writer.go:97) - flushed trace payload to the API, time:331.165328ms, size:4563 bytes
Nov 12 10:54:44 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:44 INFO (trace_writer.go:97) - flushed trace payload to the API, time:324.349926ms, size:4391 bytes
Nov 12 10:54:44 tg-staging2 trace-agent[20791]: 2018-11-12 10:54:44 INFO (stats_writer.go:265) - flushed stat payload to the API, time:330.96462ms, size:2052 bytes

renchap on 12 Nov 2018

This is odd, because according to the logs, there is a payload which was sent to our API via the trace agent, from your Ruby application. I think there might be a another issue here because those logs don't make sense if you say that there were 0 requests. I assume you are using our Ruby tracer.

My only other idea is to potentially do a tcpdump on port 8126 and see if the trace agent is receiving anything.

@pawelchcki do you have any ideas? Is there any chance the Ruby tracer is sending payloads here?

gbbr on 12 Nov 2018

Yes I am using ddtrace-rb.
I just got this graph from Datadog Logs, showing the trace-agent logs for 3 servers:

light blue: production server with production trafic (not huge traffic, but some real traffic)
darker blue: the staging server I showed above, running the same Rails/Sidekiq app than production above, with pretty much 0 traffic
purple: totally unrelated server with datadog-agent installed, trace-agent enabled but no local processes with tracing enabled

You can see that the 2 servers running Ruby tracing generate about the same number of logs, (most of it being the flushed trace/stats payload to the API lines), independant of the web traffic, where the other server does not output these lines.

I never looked at this third server, so it seems you are correct and payloads are sent by the Ruby gem even if there is no traffic, and this generated the same amount of logs regardless of the traffic.

renchap on 12 Nov 2018

Thanks a lot for providing all the details. It is really helpful. I think in that case the next step to take here is to potentially investigate why the Ruby tracer seems to be sending payloads when there seems to be no traffic. For that, however, I think the best approach to take is either to open an issue in dd-trace-rb and link to this one for details, or to go via our support. I'm more than happy to provide all the help I can on the agent side of things. If you reach out to support, feel free to mention me and I'll keep an eye out on that ticket as well.

gbbr on 12 Nov 2018

Thanks!

To come back to my initial issue, do you still think we could reduce the number of those flushed trace/stats payload to the API even if the trace agent received a few payloads, as their number seems to be constant as long as there is any activity (not depending on the amount of payloads received)?

I still think they could be reduced somehow in normal operations.

renchap on 12 Nov 2018

👍1

Yes, we definitely can and will. I have an item for this in my backlog.

gbbr on 12 Nov 2018

👍1

HI @gbbr

Just wondering if you have any idea when this item will be implemented?
Reducing the log events would also be useful for our applications

Thanks