Beats: Setting host.* in Beats that forward data

Created on 4 Oct 2019  路  14Comments  路  Source: elastic/beats

There are several use cases in Beats where the data reported by a Beat did not originate on that Beat host. Some examples are syslog, windows forwarded events, router netflow data, and cloud watch logs. In these cases it would be appropriate to set the host.* field to information about the originating machine.

From ECS:

ECS host.* fields should be populated with details about the host on which the event happened, or from which the measurement was taken.

Some issues related to this:

  • #13777
  • #13706
  • #13589
  • #10698

I think we need way for inputs and modules to be able to "designate" that host.* should not be set by default. The output pipeline and also the add_host_metadata processor will need to honor this "designation".

Filebeat Functionbeat Journalbeat Metricbeat Services Winlogbeat discussion ecs libbeat

Most helpful comment

To me it feels like the overall problem is that we have a limited set of namespaces, but yet there is potentially a trail of subsystems an event might have been passed through. Ultimately one might want an array of system descriptors the event has passed.

For now I want to solve the issue at hand, as this has come up a few times already. The host and agent fields are always overwritten. Currently host and other fields are enforced, potentially overwriting fields no matter where values come from. This is not really ECS problem itself, but a general beats one, as we also mess with users already using ECS for their own data.

Overall ECS discussions are maybe better handled in the github.com/elastic/ecs repository.

For example in filebeat we introduced a kafka input, to allow architectures like Beats->Kafka->Filebeat->Elasticsearch. The problem becomes even more apparent in this situation. In the simplest case Beats would be the host, and the second Filebeat should be the collector. We also pass some fields via @metadata from the inputs like pipeline name, document id, or index name. Once the event reaches the 'collector' filebeat, it should be treated by default as if Beats->Elasticsearch has been configured directly. And this is currently not the case.

We also do not want to introduce "heavy" breaking changes in 7.x. So for 7.x I'm planning these changes (independent of ECS):

  • libbeat changes (these are currently enforced and can't be disabled by the user):

    • Do not overwrite ecs.version, if it's already present

    • Do not change the host field or any of its subfields if already present

    • Do not update agent fields, if agent is already present

    • Do not update observer fields, if observer is already present

  • Update processors:

    • Add overwrite setting to selected list of processors that affect cloud, host, observer fields

    • Default value is `false

    • If false, the namespaces are protected. e.g. if host is already available in the event, the processor will not add any host.X fields (as these might represent the wrong host)

    • Add target or namespace setting, so users can overwrite the the field names (e.g. set target to 'collector' when add_host_metadata is used)

    • processors to be adapted: add_cloud_metadata, add_host_metadata, add_observer_metadata

    • add_locale: do not overwrite event.timezone if present, but allow user to configure alternative target field

    • combine add_host_metadata and add_observer_metadata into a common processor (it's mostly a copy and paste right now)

For 8.x I would like to remove setting host, agent or observer from within libbeat. Libbeat should not enforce fields, but allow solutions/users to opt-in. All these fields are already available via processors, and I'd prefer to provide default configs with these being enabled. Moving more functionality to processors and removing default behavior will also simplify the setup in libbeat itself.

I will create issues for individual tasks, if we agree on the plan.

All 14 comments

@webmat Would it be appropriate to populate observer.* in any of these above use cases? If so which host's data would go there?

Here are cases that should populate observer:

  • A Beat is actively monitoring a different host. E.g. Metricbeat collecting MySQL metrics from another host, like an AWS RDS instance.

    • In this case, the host running Metricbeat should go to observer.hostname & so on with other observer fields

    • In this case, the host being monitored should go into host.hostname & so on with other host fields.

  • A Beat is collecting logs locally from another agent acting as an observer (e.g. Zeek monitoring a network tap, or a Beat installed on a network appliance).

    • In this case, since both the Beat and the software doing the monitoring are on the same host, observer.hostname & other fields should be populated with that host's detail

    • If the data being collected contains information about the monitored hosts, then this goes to host.*

    • If the data being collected does not contain information about the monitored hosts (e.g. network flow stats), then host.* cannot be populated.

Here are cases that should not populate observer:

  • A Beat is receiving Syslog events and passing them along.
  • A Beat is receiving Windows Event Logs and passing them along.

In both of these cases, if the data stream contains the source host's detail, it should to go to host.*.

The case of monitoring containers may require a chat, I see a few cases, there may be more:

  • An agent can be installed on the host running Docker directly (not using Docker)

    • It can then monitor containers

    • It can monitor the host itself

  • An agent can be installed as a sidecar container, with the purpose of monitoring other containers local to the host
  • An agent can be installed in a container, with the purpose of monitoring the host

I'm not sure the current semantics as defined by ECS are clear nor sufficient to fully capture this. Maybe I'm wrong. But I'd be more than happy to have a chat with folks specialized with monitoring Docker, and hash out ideas here. Let me know if that would be useful.

I just installed filebeat 7.4 and I am using the netflow module. I have the same problem. agent.hostname is the agent.hostname for the hostname of netflow input server instead of the sender. I had to update some visualization for Kibana and replace agent.hostname with observer.ip.

It is not only the processors, but also libbeat directly adding some fields. See: https://github.com/elastic/beats/blob/master/libbeat/publisher/processing/default.go#L78

I'm in favor of not automatically modifying any event once it hits libbeat. All modification should be opt-in via processors.

I think we need way for inputs and modules to be able to "designate" that host.* should not be set by default. The output pipeline and also the add_host_metadata processor will need to honor this "designation".

I totally agree, we have this same problem with cloud metadata and cloud monitoring modules (AWS, Azure, etc). Something we have with add_cloud_metadata is that it won't override any info sent by the input/module itself:

https://github.com/elastic/beats/blob/1fb41ea3d464ba55317fc2201183f7ad80e6429a/libbeat/processors/add_cloud_metadata/add_cloud_metadata.go#L112-L117

In this case, cloud metadata for the agent is not sent, which may not be ideal (it could make sense to have it under observer.cloud?)

Something like this could make sense for add_host_metadata, where it could decide to put the metadata somewhere else (as you said, maybe observer)

The case of monitoring containers may require a chat, I see a few cases, there may be more:

  • An agent can be installed on the host running Docker directly (not using Docker)

    • It can then monitor containers
    • It can monitor the host itself
  • An agent can be installed as a sidecar container, with the purpose of monitoring other containers local to the host
  • An agent can be installed in a container, with the purpose of monitoring the host

I'm not sure the current semantics as defined by ECS are clear nor sufficient to fully capture this. Maybe I'm wrong. But I'd be more than happy to have a chat with folks specialized with monitoring Docker, and hash out ideas here. Let me know if that would be useful.

Happy to participate in this conversation @webmat!

To me it feels like the overall problem is that we have a limited set of namespaces, but yet there is potentially a trail of subsystems an event might have been passed through. Ultimately one might want an array of system descriptors the event has passed.

For now I want to solve the issue at hand, as this has come up a few times already. The host and agent fields are always overwritten. Currently host and other fields are enforced, potentially overwriting fields no matter where values come from. This is not really ECS problem itself, but a general beats one, as we also mess with users already using ECS for their own data.

Overall ECS discussions are maybe better handled in the github.com/elastic/ecs repository.

For example in filebeat we introduced a kafka input, to allow architectures like Beats->Kafka->Filebeat->Elasticsearch. The problem becomes even more apparent in this situation. In the simplest case Beats would be the host, and the second Filebeat should be the collector. We also pass some fields via @metadata from the inputs like pipeline name, document id, or index name. Once the event reaches the 'collector' filebeat, it should be treated by default as if Beats->Elasticsearch has been configured directly. And this is currently not the case.

We also do not want to introduce "heavy" breaking changes in 7.x. So for 7.x I'm planning these changes (independent of ECS):

  • libbeat changes (these are currently enforced and can't be disabled by the user):

    • Do not overwrite ecs.version, if it's already present

    • Do not change the host field or any of its subfields if already present

    • Do not update agent fields, if agent is already present

    • Do not update observer fields, if observer is already present

  • Update processors:

    • Add overwrite setting to selected list of processors that affect cloud, host, observer fields

    • Default value is `false

    • If false, the namespaces are protected. e.g. if host is already available in the event, the processor will not add any host.X fields (as these might represent the wrong host)

    • Add target or namespace setting, so users can overwrite the the field names (e.g. set target to 'collector' when add_host_metadata is used)

    • processors to be adapted: add_cloud_metadata, add_host_metadata, add_observer_metadata

    • add_locale: do not overwrite event.timezone if present, but allow user to configure alternative target field

    • combine add_host_metadata and add_observer_metadata into a common processor (it's mostly a copy and paste right now)

For 8.x I would like to remove setting host, agent or observer from within libbeat. Libbeat should not enforce fields, but allow solutions/users to opt-in. All these fields are already available via processors, and I'd prefer to provide default configs with these being enabled. Moving more functionality to processors and removing default behavior will also simplify the setup in libbeat itself.

I will create issues for individual tasks, if we agree on the plan.

This plan makes a lot of sense. Thanks for putting this together, @urso!

FWIW this also concerns APM, we are setting observer information.

@urso, regarding an overwrite option in the processors, you wrote:

If false, the namespaces are protected. e.g. if host is already available in the event, the processor will not add any host.X fields.

This sounds good but I don't think it cannot work as long as libbeat continues to set host.name because then add_host_metadata would never get added because host would always exist. We could move setting the host.name field into the add_host_metadata processor. I think that would address this problem, but would probably cause some level of breaking change (haven't thought through all the consequences yet). WDYT?

For reference this is the code that adds host.name followed by where it runs the global add_host_metadata:

https://github.com/elastic/beats/blob/61fe9fc533bc86af0cb0486a3740bf68d96556d5/libbeat/publisher/processing/default.go#L311-L317

Pinging @elastic/integrations-services (Team:Services)

Good point. Maybe it would be better to apply the 'builtin' fields after the local and global processors have been run. When applying builtins, the builtins must not overwrite existing fields, but can add missing fields to namespaces. I think this is a change we can do in 7.x. Although it is a change in behavior, we might consider the current behavior a bug, and the change a fix :)

I'm afraid that if we move the builtins after the globals that we will break existing pipelines. There are users with drop_fields in their pipelines to remove builtins.

My proposal is that we implement your plan above in master targeting 8.0. Then in order to avoid breaking changes in 7.x, we use a bit of event metadata to influence the later stages of the pipeline's handling of host metadata. See https://github.com/elastic/beats/pull/17919 which would address the pressing need of omitting host when forwarding.

7.x Changes

We should implement the changes detailed above by @urso for 8.0. In an effort to minimize breaking changes to 7.x I am making some changes to address the issue without affecting the default behavior users are currently expecting. This is mostly accomplished through updating the example configurations to demonstrate using tags to disable add_host_metadata and by adding config options where necessary to disable host.name in libbeat.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

marian-craciunescu picture marian-craciunescu  路  3Comments

EndlessTundra picture EndlessTundra  路  3Comments

kemra102 picture kemra102  路  3Comments

ptrlv picture ptrlv  路  3Comments

ycombinator picture ycombinator  路  3Comments