AWS release a feature today - convert JSON from Kinesis Firehose Stream to Apache Parquet or Apache ORC before saving to S3.
Before you needed to write and pay for AWS Glue ETL jobs to do that.
Documentation: https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
The relevant Stackoverflow question has ~2500 views meaning it was a long-awaited feature.
Suggested syntax? According to the API.
resource "aws_kinesis_firehose_delivery_stream" "test_stream" {
name = "terraform-kinesis-firehose-test-stream"
destination = "s3"
data_format_conversion {
enabled = "true"
input_format_configuration {
deserializer = "Apache Hive JSON" # or OpenX JSON
}
output_format_configuration {
serializer = "ORC" # or Parquet
}
schema_configuration {
catalog_id = "${aws_glue_catalog_database.main.catalog_id}"
database_name = "${aws_glue_catalog_database.main.name}"
table_name = "${aws_glue_catalog_table.main.name}"
role_arn = "..."
version_id = "3" # or LATEST by default
}
}
}
Prerequisite: AWS Go SDK v1.13.47 (#4512)
I'm +1 on the proposed configuration syntax here as an mvp, however the input_format_configuration and output_format_configuration sections expose many more knobs:
input:
https://docs.aws.amazon.com/firehose/latest/APIReference/API_HiveJsonSerDe.html
https://docs.aws.amazon.com/firehose/latest/APIReference/API_OpenXJsonSerDe.html
output:
https://docs.aws.amazon.com/firehose/latest/APIReference/API_ParquetSerDe.html
https://docs.aws.amazon.com/firehose/latest/APIReference/API_OrcSerDe.html
It appears that most of these are optional, but will be returned in DescribeDeliveryStream: https://docs.aws.amazon.com/firehose/latest/APIReference/API_DescribeDeliveryStream.html
I will try to get a pull request submitted for this tomorrow or Wednesday.
Ack -- I only got about halfway through implementing the 36(!) new attributes required in the full schemas for serializers/deserializers before I ran out of time before I head out on a short vacation. I'll be able to pick this back up on Tuesday unless someone wants to get something in sooner.
Pull request submitted with all underlying options: #4842
Support has been merged into master and will release with version 1.24.0 of the AWS provider, likely middle of this week. 🎉
This has been released in version 1.24.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!
Most helpful comment
I will try to get a pull request submitted for this tomorrow or Wednesday.