Refactor our logging to use a structured logger and log levels. Structured logging allows us to use a consistent method of logging messages and a way of filtering messages for support. It also allows us to format messages for either the traditional log file that we normally use and like along with JSON messages for consumption by automated log tools like Splunk.
Proposal: Refactor logging to a structured logger and log levels
Current behavior: We use the Go standard log package
Desired behavior: Using the structured logger
Use case: It makes log files easier to follow and filter out messages that may only be a concern to those doing development.
/cc @beckettsean @corylanou
More useful if we also implement #7037.
Yes, #7037 along with log levels would be a killer feature for gathering diagnostic information from customers. It would also allow us to recommend WARN level for production systems and only do INFO or DEBUG when needed.
We should also add the ability to do backoff logging, so if you see the same type of message over and over, it doesn't fill up logs. Things like can't write to remote node showing up 10,000 in a second aren't useful, regardless of log level.
@corylanou my favorite example from a system with a Graphite endpoint that was getting spammed by bad connections. The error [graphite] 2016/05/02 03:44:09 unable to parse line: : received " " which doesn't have required fields makes up over 99% of the log.
$ wc -l influxd.log
33376296 influxd.log
$ grep "unable to parse" influxd.log | wc -l
33232791
So in a 33 million line log file (3.3GB uncompressed), only 143,505 lines are anything other than the Graphite error. About 385 lines per second.
+1 & allow configurable log handlers... as all levels ends into a single writer (/dev/stderr), to automatically catch some unexpected patterns and raise attention to humans to investigate deeper requires parsing of those logs.
Btw, I know structured logging can be expensive and https://github.com/uber-go/zap/ is cheap
@bclermont zap looks really good. I'll be looking into that as a potential solution for this.
A few notes about what I want out of structured logging. First, a step by step plan.
When it comes to the output and what features we want, I think it should be possible to set the log level differently for different parts of the system. I don't think that's needed out of the first version, but we should be able to dynamically change the level of these different loggers too.
I would also like to see it possible to use different output methods. I would personally like the default to be one similar to the current output with colors when a terminal is being used, but I know others want logfmt to be the default. I think making this configurable is easiest rather than deciding on what everyone might want. JSON is definitely something I think should be in there. Even if I don't personally like JSON logging, common devops tools like Splunk have the ability to inject JSON logs and I think it's a decent enough feature as long as it doesn't take too much time to implement.
Thoughts on uber-go/zap. Initially, I really like the idea and I feel positive to see what direction they go until it becomes a stable API. The sad part is I really hope they rethink the current API because it seems to be missing a few key aspects in the publicly exposed API that make me rethink using it at the current moment.
First, for things I like. I like the use of interfaces and I like the options. I think the model for dealing with fields is superior because of fewer allocations and I like the nesting. I also like zap.CheckedMessage and how the various different levels are handled. I like how everything is typed and that interface{} is not used at all (there are a few popular structured logging libraries like log15 that use interface{} that I don't like). I much prefer log.Info("message", zap.String(key, value)) over log.Info("message", key, value). I view the latter as too prone to mistakes and it won't get caught either in compilation or go vet since it's not one of the printf rules. I also like it better than logrus since, even though they use a similar system there, logrus constructs a map to pass the fields and then has to perform sorting rather than just passing those values as a slice and already being ordered. apex/log uses something similar to logrus and I like that better, but it's on par with how zap does it.
The negatives I've run into are a mostly inconsistent API when you try to do something more complicated. It's trivial to create a single logger and build child loggers off of that and have them all write to the same place. But, if you need to have various different levels for those loggers, you are mostly out of luck.
If you need to have multiple loggers for some reason (like having different writing outputs or different levels), you mostly can't share the same output. In order to synchronize the output, you need to use zap.Output() to wrap the output in a writer that's threadsafe. But, if you use zap.Output() to generate an Option, it is constructed in a way where the threadsafe writer is used between multiple writers. So you have to log to multiple different outputs or supply your own threadsafe writer.
Now the log output is stored in a zap.Meta and you can construct one of those using zap.MakeMeta, but this function appears to be useless. There's no way to construct your own zap.Meta and pass it to a zap.Logger. You have to construct a logger with zap.Encoder and zap.Option to construct the logger which leads to the above problem.
You also can't clone the logger in anyway or pass new options to child loggers so you're stuck with whatever options you pass into the root logger.
Maybe we can use it anyway, but any ability to change log levels or use other outputs does not seem like it will be possible unless they change the API a bit. Likely, just allowing the construction of a zap.Logger from a custom zap.Meta is all that is needed. I can go ask on their bug tracker if they are planning to do something like that. I think if I could change just that one thing the library would work perfectly.
Most helpful comment
@corylanou my favorite example from a system with a Graphite endpoint that was getting spammed by bad connections. The error
[graphite] 2016/05/02 03:44:09 unable to parse line: : received " " which doesn't have required fieldsmakes up over 99% of the log.So in a 33 million line log file (3.3GB uncompressed), only 143,505 lines are anything other than the Graphite error. About 385 lines per second.