This issue represents the introduction of a new source that can be used as the HTTP endpoint for an AWS Kinesis Firehose destination.
This source will specifically be designed to read forwarded AWS CloudWatch Logs first (format, but could be expanded to other Firehose sources in the future.
I recommend a spike into this to discover and answer open questions and, possibly, an RFC if there appear to be more decisions to be made.
Other work that will probably need to happen as part of this:
This will be pretty relevant now! Check this out: https://aws.amazon.com/about-aws/whats-new/2020/08/cloudfront-realtimelogs/
You can also easily deliver these logs to a generic HTTP endpoint using Amazon Kinesis Data Firehose.
I'd love to use vector to ingest such data...
馃ぉ
Jotting some notes here as I explore this.
I set up the pipeline to publish CloudWatch Logs -> Firehose -> HTTP endpoint (largely following https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html).
Here is an example request for forwarded CloudWatch Logs (including headers): https://gist.github.com/jszwedko/46bf9338c737cd3b1ac5f5cd39c48daa
The data part of each record is base64-encoded gzipped data. Decoding the first one in that request looks like (trimmed):
{
"messageType": "DATA_MESSAGE",
"owner": "071959437513",
"logGroup": "/jesse/test",
"logStream": "test",
"subscriptionFilters": [
"Destination"
],
"logEvents": [
{
"id": "35683658089614582423604394983260738922885519999578275840",
"timestamp": 1600110569039,
"message": "{\"bytes\":26780,\"datetime\":\"14/Sep/2020:11:45:41 -0400\",\"host\":\"157.130.216.193\",\"method\":\"PUT\",\"protocol\":\"HTTP/1.0\",\"referer\":\"https://www.principalcross-platform.io/markets/ubiquitous\",\"request\":\"/expedite/convergence\",\"source_type\":\"stdin\",\"status\":301,\"user-identifier\":\"-\"}"
},
{
"id": "35683658089659183914001456229543810359430816722590236673",
"timestamp": 1600110569041,
"message": "{\"bytes\":17707,\"datetime\":\"14/Sep/2020:11:45:41 -0400\",\"host\":\"109.81.244.252\",\"method\":\"GET\",\"protocol\":\"HTTP/2.0\",\"referer\":\"http://www.investormission-critical.io/24/7/vortals\",\"request\":\"/scale/functionalities/optimize\",\"source_type\":\"stdin\",\"status\":502,\"user-identifier\":\"feeney1708\"}"
},
...
Curiously there doesn't appear to be any sort of enum in the records or headers to determine that the incoming events are from CloudWatch Logs. It seems like we'll simply need to rely on the schema matching (perhaps by just looking for a logEvents key).
Firehose also pings with control messages on a regular basis. The content of the records looks like:
{"messageType":"CONTROL_MESSAGE","owner":"CloudwatchLogs","logGroup":"","logStream":"","subscriptionFilters":[],"logEvents":[{"id":"","timestamp":1600110003794,"message":"CWL CONTROL MESSAGE: Checking health of destination Firehose."}]}
@jszwedko have you tested the CloudFront integration? Curious to see what you think...
Curiously there doesn't appear to be any sort of enum in the records or headers to determine that the incoming events are from CloudWatch Logs.
Yeah that's odd. I'm not 100% into how Vector is internally architected, but I guess you'd need it to skip invalid entries & for the input model?
I haven't tried the CloudFront integration just yet. This initial pass will be focused on CloudWatch Logs, but it should be easy to add support for additional services like CloudFront once the pattern is set.
Yeah that's odd. I'm not 100% into how Vector is internally architected, but I guess you'd need it to skip invalid entries & for the input model?
Yeah, my realization was that Kinesis Firehose is similar to Kinesis in that it is just passing arbitrary bytes around. It's up to the consumer and producer to coordinate what those bytes mean. In this case, we'll need to rely on the user configuring vector to indicate that messages being passed via Firehose are from an AWS CloudWatch Logs subscription so that we can parse them as such.
Most helpful comment
This will be pretty relevant now! Check this out: https://aws.amazon.com/about-aws/whats-new/2020/08/cloudfront-realtimelogs/
I'd love to use vector to ingest such data...
馃ぉ