Vector: New `http` transform

Created on 18 Dec 2019 · 12Comments · Source: timberio/vector

Proposal

We should add the ability for a user to enrich their log events from some form of external metadata.

Solution

Provide a http transform that can hit some http endpoint on a set interval and enrich log events with fields from this http endpoint. This is very similar to the aws_ec2_metadata transform except its more general. I suggest we start with just json decoding and use the same evmap approach.

Options

We should provide these set of options to users:

endpoint this should be the endpoint to hit
fields this should be a list of fields to include from the map of json returned.
refresh_interval_secs the interval in seconds at which we will refresh with new data.
clear_old when set to true when new data comes in it will clear the map first.

transforms idea approval requirements feature

Source

LucioFranco

👍3

Most helpful comment

@Jeffail I _personally_ think that type of use case is way out of scope for vector since that would basically mean we can only process events as fast as the external resource. Which to me doesn't really seem like something we would even want to support.

@binarylogic I don't know of anyone directly requesting this but I can see this being used to support fetching metadata for events from things like consul that run an agent per machine. This is different than add_fields because its dynamic and can change over time via new data coming from the http request. Its different from lua because it would be much much faster. Basically, this http transform and aws_ec2_metadata are the least overhead transforms that can enrich dynamically since they remove the network IO from the critical path.

LucioFranco on 18 Dec 2019

👍3

All 12 comments

If there's a possibility in the future of introducing a transform that hits an HTTP service for each event (posting its contents) then we ought to consider giving this specific transform a name that distinguishes it.

Jeffail on 18 Dec 2019

@Jeffail the goal of this specific transform is to only enrich events based on some dynamic external resource. We don't provide that each event will get the data the time the external resource gets it.

I think you are explaining the http sink?

LucioFranco on 18 Dec 2019

Can you provide an actual use case for this? I'm curious why someone would use this over a simple add_fields or lua transform?

binarylogic on 18 Dec 2019

Yeah I understand the idea behind this transform, I'm just speculating that maybe in the future we might also want a request/reply pattern type transform that hits a service with the contents of an event and enriches it with the response.

If we wanted to support more data science type use cases then it would allow users to hit enrichment services like language detection or sentiment analysis with the message contents of the event before sending it to their index.

If we were to do that then we'd need to find a new name.

Jeffail on 18 Dec 2019

LucioFranco on 18 Dec 2019

👍3

As a thought, could it be a source rather than a transform? Then the fields from the response could be joined with the actual events which are intended to be enriched using something like this: https://github.com/timberio/vector/issues/1200. This could allow to perform transformations on the responses, for example using regex_parser to extract data from responses which are not in a standard format like JSON.

a-rodin on 18 Dec 2019

@a-rodin I think we could in theory provide a separate source that is basically a pull based http client, not sure exactly the use case for that but it would be similar to what the prometheus source does.

I think this transforms specifically is to enable stream table joins for users that may be on systems that we don't support.

LucioFranco on 18 Dec 2019

@LucioFranco I like this use case, I've been thinking about it for long time (see "currency rate conversion" example from https://github.com/timberio/vector/issues/1041). And having it implemented as a native transform, as opposed to using it from scripts, can greatly improve performance.

A few implementation questions:

Could it make sense to use Expires/Cache-Control HTTP headers to automatically derive refresh_interval_secs if it is not provided?
Would it be possible to configure behavior in case if the HTTP request fails? For example, in some cases the events need to be completely dropped if the HTTP request in the transform failed and in other cases new fields might be just not added, but the events passed further to the sinks.

a-rodin on 18 Dec 2019

Could it make sense to use Expires/Cache-Control HTTP headers to automatically derive refresh_interval_secs if it is not provided?

Yeah, this is totally something we should support but I would hope that we could use them together, you have a fixed interval and can also use that header to drive another request.

Would it be possible to configure behavior in case if the HTTP request fails? For example, in some cases the events need to be completely dropped if the HTTP request in the transform failed and in other cases new fields might be just not added, but the events passed further to the sinks.

This is def something we could configure, but should also be possible with a mix of using clear_old and a lua script to drop the event if the fields don't exist.

LucioFranco on 18 Dec 2019

Yeah, this is totally something we should support but I would hope that we could use them together, you have a fixed interval and can also use that header to drive another request.

I think there might be cases when Expires headers might be not always correct, for example, a server could always return 0 which means that the resource is already expired. However, this probably can be solved by adding an additional option. For example, there could be boolean option http_cache_control which would enable/disable usage of the HTTP headers.

a-rodin on 18 Dec 2019

👍1

I want to add this to https://github.com/timberio/vector/milestone/23, as I think it fits nicely to the vectorized approach outlined in https://github.com/timberio/vector/issues/1041#issuecomment-567173838.

a-rodin on 18 Dec 2019

👍1

We have a use case for this!

We're running a nomad cluster and want to forward logs to our users. Nomad output logs to files, but the path only gives us 2 pieces of information: allocation id and task name. We have an "app id" in each task's metadata that we need to tag log events with. The information is available via an HTTP request + some JSON parsing. The HTTP client should accept client certificates for authentication.

We're looking at using vector at fly.io for much more than this, but this specific use case is a way to test vector.