Vector: [aws_s3] vector is unable to start if ec2 instance does not have rw to S3 bucket.

Created on 18 May 2020  路  3Comments  路  Source: timberio/vector

We use healthcheck for S3 sink but still vector is unable to start when he get 403 from S3 bucket api and stop.
Output:

May 15 10:29:14.240 ERROR sink{name=s3_archives type=aws_s3}:request{request_id=1}: vector::sinks::util::retries: encountered non-retriable error. error=Sts AssumeRoleError: Unknown(BufferedHttpResponse {status: 403, body: "<ErrorResponse xmlns=\"https://sts.amazonaws.com/doc/2011-06-15/\">\n  <Error>\n    <Type>Sender</Type>\n    <Code>AccessDenied</Code>\n    <Message>User: arn:aws:sts::xxxx is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::xxxx</Message>\n  </Error>\n  <RequestId>xxxx</RequestId>\n</ErrorResponse>\n", headers: {"x-amzn-requestid": "xxxx", "content-type": "text/xml", "content-length": "425", "date": "Fri, 15 May 2020 10:29:13 GMT"} })
May 15 10:29:14.240 ERROR sink{name=s3_archives type=aws_s3}: vector::sinks::util::sink: Request failed. error=Sts AssumeRoleError: Unknown(BufferedHttpResponse {status: 403, body: "<ErrorResponse xmlns=\"https://sts.amazonaws.com/doc/2011-06-15/\">\n  <Error>\n    <Type>Sender</Type>\n    <Code>AccessDenied</Code>\n    <Message>User: arn:aws:sts::xxxx is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::xxxx</Message>\n  </Error>\n  <RequestId>xxxx</RequestId>\n</ErrorResponse>\n", headers: {"x-amzn-requestid": "xxxx", "content-type": "text/xml", "content-length": "425", "date": "Fri, 15 May 2020 10:29:13 GMT"} })
May 15 10:29:14.240 ERROR sink{name=s3_archives type=aws_s3}: vector::sinks::aws_s3: Sink failed to flush: Sts AssumeRoleError: Unknown(BufferedHttpResponse {status: 403, body: "<ErrorResponse xmlns=\"https://sts.amazonaws.com/doc/2011-06-15/\">\n  <Error>\n    <Type>Sender</Type>\n    <Code>AccessDenied</Code>\n    <Message>User: arn:aws:sts::xxxx is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::xxxx</Message>\n  </Error>\n  <RequestId>-xxxx</RequestId>\n</ErrorResponse>\n", headers: {"x-amzn-requestid": "xxxx", "content-type": "text/xml", "content-length": "425", "date": "Fri, 15 May 2020 10:29:13 GMT"} })
May 15 10:29:14.244 ERROR vector::topology: Unhandled error

Our configuration:

sinks.s3_archives_auth ]
  inputs       = ["file_auth"] # don't sample
  type         = "aws_s3"
  region       = "us-east-1"
  bucket       = "BUCKET_NAME"
  encoding     = "text"
  compression  = "gzip"
  healthcheck  = true # optional, default
  # Batch
  batch.max_size     = 100000000 # uncompressed bytes
  batch.timeout_secs = 900 # optional, default, seconds
  # Buffer
  buffer.type      = "memory" # optional, default
  buffer.max_events = 500
  #buffer.max_size  = 253741824 # required, bytes, required when type = "disk"
  buffer.when_full = "block" # optional, default
  # Encryption
  server_side_encryption = "AES256" # optional, no default
  assume_role = "IAM_ROLE_NAME"
  # Metadata not more than 10 keys
  tags.source = "system-auth"
  tags.hostname = "${VECTOR_HOSTNAME}"
  tags.role = "${VECTOR_ROLE}"
  tags.cluster = "${VECTOR_CLUSTER}"
  # Naming
  filename_append_uuid = true # optional, default
  filename_extension = "log" # optional, default
  filename_time_format = "%s" # optional, default
  key_prefix = "system/auth/%Y/%m/%d/${VECTOR_HOSTNAME}_" # optional, default
  # Request
  request.in_flight_limit = 50 # optional, default, requests
  request.rate_limit_duration_secs = 1 # optional, default, seconds
  request.rate_limit_num = 250 # optional, default
  request.retry_attempts = 300 # optional, default

Base on doc: https://vector.dev/docs/reference/sinks/aws_s3/#health-checks vector should log alert and start.

aws_s3 bug help

Most helpful comment

You're right, Vector should not be shutting down in this case.

The issue seems to be that we propagate the failed request error here:

https://github.com/timberio/vector/blob/5d993333eb67bc05996cf6d77e94296a27bc537e/src/sinks/util/sink.rs#L679-L685

Which is then returned as a top-level sink error from a number of possible places in PartitionedBatchSink::poll_complete. It seems like this was accidentally introduced in #2111.

We need to make it clearer what our error propagation boundaries are for cases like this and I'll look to address that in #2625.

For now, I think the simplest fix is likely to change that return to simply log the error and continue.

All 3 comments

@kaarolch it looks like Vector does not have permission to assume roles. You have this option set:

assume_role = "IAM_ROLE_NAME"

Is that intentional? If you meant for this to be an env var you should set it like so:

assume_role = "${IAM_ROLE_NAME}"

Let me know if that helps.

@binarylogic I fortgot to add that assume_role and bucket was anonymized
Sometimes we change some configuration inside infrastrucutre and IMHO when somehow S3 retunr 403 (we use wrong role or wrong ACL in S3) vector should continue to work but not shipp log to this bucket if a healthcheck is set to true. In our case vector stop with ERROR vector::topology: Unhandled error

You're right, Vector should not be shutting down in this case.

The issue seems to be that we propagate the failed request error here:

https://github.com/timberio/vector/blob/5d993333eb67bc05996cf6d77e94296a27bc537e/src/sinks/util/sink.rs#L679-L685

Which is then returned as a top-level sink error from a number of possible places in PartitionedBatchSink::poll_complete. It seems like this was accidentally introduced in #2111.

We need to make it clearer what our error propagation boundaries are for cases like this and I'll look to address that in #2625.

For now, I think the simplest fix is likely to change that return to simply log the error and continue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

raghu999 picture raghu999  路  3Comments

valyala picture valyala  路  3Comments

Hoverbear picture Hoverbear  路  3Comments

a-rodin picture a-rodin  路  3Comments

jhgg picture jhgg  路  4Comments