Vector: Robust handling of non-retriable errors

Created on 16 Jul 2019  路  4Comments  路  Source: timberio/vector

When a non-retriable error happens in sink, Vector reports it to logs and stops sending the data to the sink.

For example, I have this error for s3 sink in the logs

vector::sinks::util::retries: encountered non-retriable error: error writing a body to connection: Connection reset by peer (os error 104)
vector::sinks::util: request failed: error writing a body to connection: Connection reset by peer (os error 104)

and no data was sent to S3 after that.

Here the error is clearly retriable as a new connection to S3 API endpoint could have been opened.

I see two possibilities to deal with this:

  • Make more errors retriable. For s3 sink HttpDispatch and maybe also ParseError, Unknown errors could be considered retriable.
  • Add an option to quit when a non-retriable error occurrs. In this case Vector can be restarted automatically by the system that started it, for example, k8s or ECS.
tests aws_s3 bug

All 4 comments

Thanks for reporting this @a-rodin! You're totally right, we should be handling these connection resets much better. I think your suggestion of making these errors retriable is the right path and we'll dig into it.

no data was sent to S3 after that

I'm having trouble reproducing this part of the issue. In my tests, after a non-retriable error, the sink will continue to make subsequent requests (i.e. with the next batch of data that comes in) normally.

Did you observe behavior that contradicts that @a-rodin? Or is it possible that after the original failed request there was no additional data to be sent?

With #651 merged, I'm going to close this and shift the broader errors conversation to #655.

If we can confirm and reproduce the case of non-retriable errors preventing further requests from being made, we can reopen and investigate further.

I'm having trouble reproducing this part of the issue. In my tests, after a non-retriable error, the sink will continue to make subsequent requests (i.e. with the next batch of data that comes in) normally.

It turned out there is another problem, happening together with this one: when I use kafkacat with stdin source (see #582 for description of the setup), kafkacat spontaneously stops writing the data to stdout. However, it might be a problem of kafkacat and now Vector, so probably it should not be addressed here.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LucioFranco picture LucioFranco  路  3Comments

leebenson picture leebenson  路  3Comments

a-rodin picture a-rodin  路  3Comments

kaarolch picture kaarolch  路  3Comments

LucioFranco picture LucioFranco  路  3Comments