Amazon-ecs-agent: Refactor Agent logging

Created on 19 Nov 2019  Ā·  4Comments  Ā·  Source: aws/amazon-ecs-agent

Summary

We will be refactoring the ECS agent's logging to make it more flexible and powerful.

This is a master logging ticket to cover issues #1738, #881, and #287

Design details are being finalized but will be posted here when ready for public consumption and comment.

kinlogging

Most helpful comment

Hello everyone, please review and comment on the following plan for our logs going forward.

Problem statement

This project aims to address the following problems:

  1. Logfiles should be able to rotate based on size, so that customers can configure the rotation to guarantee that the disk isn’t filled. (see https://github.com/aws/amazon-ecs-agent/issues/1738)
  2. Structured (JSON) logging should be supported, so that customers can machine-parse their logs (see https://github.com/aws/amazon-ecs-agent/issues/881)
  3. The ā€˜current’ logfile shouldn’t have any suffixes added. Suffixes should only be added when the file is rotated. This will allow customers easier integration with standard linux logging tools, such as syslog. (see https://github.com/aws/amazon-ecs-agent/issues/287)

Solution

overview

We plan to make the following overall changes:

  1. Rollover Behavior: change rollover behavior to remove the suffix on the ā€˜current’ logfile, and support size-based file rotation.
  2. Output format: change the default logging format to logfmt, and support a JSON output format.
  3. Custom context and loggers: add a ā€˜module’ context and add additional log context for a few selected structs.

Rollover behavior

Currently we use a date-based rollover behavior that rotates the logfile every hour, resulting in logfiles that look like /var/log/ecs/ecs-agent.log.2019-12-09-21

This behavior will remain the same for date-based rollovers, with one exception, that the ā€œcurrentā€ logfile will always be named /var/log/ecs/ecs-agent.log. only rotated files receive a suffix. The primary reasoning behind this is because many implementations of syslog do not support wildcard matches, which makes integrating agent logs with syslog impossible for many of our customers.

Size-based rollovers will use the standard linux convention of appending a number to the rotated files. ie, the ā€œcurrentā€ logfile will be named /var/log/ecs/ecs-agent.log, with rotated files getting a numerical suffix, ie,
/var/log/ecs/ecs-agent.log.1, /var/log/ecs/ecs-agent.log.2, etc.

Output format

The default output format will be a structured log line in the style of logfmt, we will also be adding the ā€˜module’ context which will simply be the filename from where the log msg came.
level=info time=2019-12-12T23:43:29Z msg="Initializing stats engine" module=engine.go

The reason for this format is that it is fairly easily to machine-parse in addition to being human-readable. It has become something of a defacto standard (also being used in uber/zap and logrus, for example), and it conforms to Splunk’s logging best practices (see https://dev.splunk.com/enterprise/docs/developapps/logging/loggingbestpractices/)

The json output format will output the same information, but in a one-line json format:
{"time": "2019-11-07T22:52:02Z", "level": "info", "msg": "Starting Amazon Elastic Container Service Agent", "module": "engine.go"}

Custom context and loggers

First off we can add a ā€˜module’ context that states the filename that the log was generated in.

Currently, we create a single logger for the entire agent project. In order to guarantee that a standard set of context gets added, we will add custom loggers with context added for a few select structs:

|struct |custom context |
|--- |--- |
|Task |taskARN |
|Container |containerName |
|ENI |ENIID |
|ManagedTask |taskARN |
|VolumeResource |volumeName |

As an example, one current logline in managedTask would go from this:
2019-12-13T00:16:51Z [INFO] Managed task [arn:aws:ecs:us-west-2:039193556298:task/auto-robc/33d755ba06e94b3594ba9c2da95a6f4a]: waiting for event for task

to this:
level=info time=2019-12-13T00:16:51Z msg="waiting for event for task" module=task_manager.go taskARN=arn:aws:ecs:us-west-2:039193556298:task/auto-robc/33d755ba06e94b3594ba9c2da95a6f4a

The benefit here is that agent developers do not need to add context on a case-by-case basis, which often leads to inconsistent or missing context when multiple developers are working on a project.

As an example of developer experience, we will be changing logging calls that look like this:
seelog.Infof("initialize volume: Task [%s]: non-autoprovisioned volume not found", task.Arn)

to this:
task.log.Infof("initialize volume: non-autoprovisioned volume not found")

configuration

The following configuration environment variables will need to be added to the agent:

|Config Var |Options |Description |Default value |
|--- |--- |--- |--- |
|ECS_LOG_ROLLOVER_TYPE |"date" or "size" |Controls whether the logfile will be rotated based on size or hour |date |
|ECS_LOG_OUTPUT_FORMAT |"logfmt" or "json" |Controls the log output format. json would have each logline being a structured json map containing timestamp, message, and log level. |logfmt |
|ECS_LOG_MAX_FILE_SIZE_MB |any float |If rollover behavior is set to "size", this controls the maximum size (in megabytes) of each logfile |10 |
|ECS_LOG_MAX_ROLL_COUNT |any integer |Controls the number of rotated logfiles to keep. Applies to both size and date rollover options. |24 |
| | | | |

Appendix

Agent startup logfile example (logfmt)

level=info time=2019-12-12T23:43:29Z msg="Loading configuration" module=agent.go
level=info time=2019-12-12T23:43:29Z msg="Image excluded from cleanup: amazon/amazon-ecs-agent:latest" module=parse.go
level=info time=2019-12-12T23:43:29Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=parse.go
level=info time=2019-12-12T23:43:29Z msg="Amazon ECS agent Version: 1.34.0, Commit: ca640387" module=agent.go
level=info time=2019-12-12T23:43:29Z msg="Creating root ecs cgroup: /ecs" module=init_linux.go
level=info time=2019-12-12T23:43:29Z msg="Creating cgroup /ecs" module=cgroup_controller_linux.go
level=info time=2019-12-12T23:43:29Z msg="Loading state!" module=statemanager.go
level=info time=2019-12-12T23:43:29Z msg="Event stream ContainerChange start listening..." module=eventstream.go
level=info time=2019-12-12T23:43:29Z msg="Restored cluster 'auto-robc'" module=agent.go
level=info time=2019-12-12T23:43:29Z msg="Restored from checkpoint file. I am running as 'arn:aws:ecs:us-west-2:039193556298:container-instance/auto-robc/3330a8a91d15464ea30662d5840164cd' in cluster 'auto-robc'" module=agent.go

All 4 comments

Thanks for starting on this!

Can you clarify what this Issue is addressing exactly if ECS_AVAILABLE_LOGGING_DRIVERS allows json-file already?

See: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html

@nehalrp This issue is for addressing problems with the logfiles that the ecs agent writes to the /var/log/ecs/ directory.

The env var you are referencing is used for configuring the instance's docker daemon's available docker logging drivers, which I believe is more related to the logs generated by the tasks running on the instance.

Hello everyone, please review and comment on the following plan for our logs going forward.

Problem statement

This project aims to address the following problems:

  1. Logfiles should be able to rotate based on size, so that customers can configure the rotation to guarantee that the disk isn’t filled. (see https://github.com/aws/amazon-ecs-agent/issues/1738)
  2. Structured (JSON) logging should be supported, so that customers can machine-parse their logs (see https://github.com/aws/amazon-ecs-agent/issues/881)
  3. The ā€˜current’ logfile shouldn’t have any suffixes added. Suffixes should only be added when the file is rotated. This will allow customers easier integration with standard linux logging tools, such as syslog. (see https://github.com/aws/amazon-ecs-agent/issues/287)

Solution

overview

We plan to make the following overall changes:

  1. Rollover Behavior: change rollover behavior to remove the suffix on the ā€˜current’ logfile, and support size-based file rotation.
  2. Output format: change the default logging format to logfmt, and support a JSON output format.
  3. Custom context and loggers: add a ā€˜module’ context and add additional log context for a few selected structs.

Rollover behavior

Currently we use a date-based rollover behavior that rotates the logfile every hour, resulting in logfiles that look like /var/log/ecs/ecs-agent.log.2019-12-09-21

This behavior will remain the same for date-based rollovers, with one exception, that the ā€œcurrentā€ logfile will always be named /var/log/ecs/ecs-agent.log. only rotated files receive a suffix. The primary reasoning behind this is because many implementations of syslog do not support wildcard matches, which makes integrating agent logs with syslog impossible for many of our customers.

Size-based rollovers will use the standard linux convention of appending a number to the rotated files. ie, the ā€œcurrentā€ logfile will be named /var/log/ecs/ecs-agent.log, with rotated files getting a numerical suffix, ie,
/var/log/ecs/ecs-agent.log.1, /var/log/ecs/ecs-agent.log.2, etc.

Output format

The default output format will be a structured log line in the style of logfmt, we will also be adding the ā€˜module’ context which will simply be the filename from where the log msg came.
level=info time=2019-12-12T23:43:29Z msg="Initializing stats engine" module=engine.go

The reason for this format is that it is fairly easily to machine-parse in addition to being human-readable. It has become something of a defacto standard (also being used in uber/zap and logrus, for example), and it conforms to Splunk’s logging best practices (see https://dev.splunk.com/enterprise/docs/developapps/logging/loggingbestpractices/)

The json output format will output the same information, but in a one-line json format:
{"time": "2019-11-07T22:52:02Z", "level": "info", "msg": "Starting Amazon Elastic Container Service Agent", "module": "engine.go"}

Custom context and loggers

First off we can add a ā€˜module’ context that states the filename that the log was generated in.

Currently, we create a single logger for the entire agent project. In order to guarantee that a standard set of context gets added, we will add custom loggers with context added for a few select structs:

|struct |custom context |
|--- |--- |
|Task |taskARN |
|Container |containerName |
|ENI |ENIID |
|ManagedTask |taskARN |
|VolumeResource |volumeName |

As an example, one current logline in managedTask would go from this:
2019-12-13T00:16:51Z [INFO] Managed task [arn:aws:ecs:us-west-2:039193556298:task/auto-robc/33d755ba06e94b3594ba9c2da95a6f4a]: waiting for event for task

to this:
level=info time=2019-12-13T00:16:51Z msg="waiting for event for task" module=task_manager.go taskARN=arn:aws:ecs:us-west-2:039193556298:task/auto-robc/33d755ba06e94b3594ba9c2da95a6f4a

The benefit here is that agent developers do not need to add context on a case-by-case basis, which often leads to inconsistent or missing context when multiple developers are working on a project.

As an example of developer experience, we will be changing logging calls that look like this:
seelog.Infof("initialize volume: Task [%s]: non-autoprovisioned volume not found", task.Arn)

to this:
task.log.Infof("initialize volume: non-autoprovisioned volume not found")

configuration

The following configuration environment variables will need to be added to the agent:

|Config Var |Options |Description |Default value |
|--- |--- |--- |--- |
|ECS_LOG_ROLLOVER_TYPE |"date" or "size" |Controls whether the logfile will be rotated based on size or hour |date |
|ECS_LOG_OUTPUT_FORMAT |"logfmt" or "json" |Controls the log output format. json would have each logline being a structured json map containing timestamp, message, and log level. |logfmt |
|ECS_LOG_MAX_FILE_SIZE_MB |any float |If rollover behavior is set to "size", this controls the maximum size (in megabytes) of each logfile |10 |
|ECS_LOG_MAX_ROLL_COUNT |any integer |Controls the number of rotated logfiles to keep. Applies to both size and date rollover options. |24 |
| | | | |

Appendix

Agent startup logfile example (logfmt)

level=info time=2019-12-12T23:43:29Z msg="Loading configuration" module=agent.go
level=info time=2019-12-12T23:43:29Z msg="Image excluded from cleanup: amazon/amazon-ecs-agent:latest" module=parse.go
level=info time=2019-12-12T23:43:29Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=parse.go
level=info time=2019-12-12T23:43:29Z msg="Amazon ECS agent Version: 1.34.0, Commit: ca640387" module=agent.go
level=info time=2019-12-12T23:43:29Z msg="Creating root ecs cgroup: /ecs" module=init_linux.go
level=info time=2019-12-12T23:43:29Z msg="Creating cgroup /ecs" module=cgroup_controller_linux.go
level=info time=2019-12-12T23:43:29Z msg="Loading state!" module=statemanager.go
level=info time=2019-12-12T23:43:29Z msg="Event stream ContainerChange start listening..." module=eventstream.go
level=info time=2019-12-12T23:43:29Z msg="Restored cluster 'auto-robc'" module=agent.go
level=info time=2019-12-12T23:43:29Z msg="Restored from checkpoint file. I am running as 'arn:aws:ecs:us-west-2:039193556298:container-instance/auto-robc/3330a8a91d15464ea30662d5840164cd' in cluster 'auto-robc'" module=agent.go
Was this page helpful?
0 / 5 - 0 ratings