Vector: Intermittent panic in CWL sink

Created on 15 Jul 2019  路  9Comments  路  Source: timberio/vector

During our evaluation of Vector, we have observed that the CWL sink sometimes crashes with this traceback, usually within hours:

ESC[2mJul 15 17:11:07.108 ESC[0mTRACE source{name="africa"}:file-server: vector::sources::file: Recieved one event. file="/access.log"
ESC[2mJul 15 17:11:07.596 ESC[0mDEBUG vector::sinks::aws_cloudwatch_logs: Submitting events. count=2
ESC[2mJul 15 17:11:07.704 ESC[0mERROR sink{name="cwl_africa"}: vector::sinks::util: request failed: CloudwatchError::Put: connection closed before message completed
ESC[2mJul 15 17:11:11.209 ESC[0mTRACE source{name="africa"}:file-server: file_source::file_server: Read bytes. path="/access.log" bytes=96
ESC[2mJul 15 17:11:11.209 ESC[0mTRACE source{name="africa"}:file-server: vector::sources::file: Recieved one event. file="/access.log"
thread 'tokio-runtime-worker-1' panicked at 'The in flight future was dropped!', src/sinks/aws_cloudwatch_logs.rs:340:35
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

From the log, it seems that the sink is dead at this point and will not forward any more events.

If I understand this correctly, the message originates from
https://github.com/hyperium/hyper/blob/5d44f8841c4153174e823acfdde2a1c6d9b66d51/src/proto/h1/conn.rs#L300 . I find it somewhat strange that this should hit PutLogEvents with such frequency, since the user is dependent on reading the sequence token from the reply; had this been an AWS issue, it would have been found and corrected by now. Thus, I think this is an issue with Vector, Rusoto or Hyper.

Running master commit 4a88262 on various EC2 instances with Amazon Linux 2018.03. Configuration:

[sources.input]
  type         = "file"
  include      = ["/*.log"]
[sinks.cwl]
  type = "aws_cloudwatch_logs"
  inputs = ["input"]
  group_name = "/ze-group"
  region = "eu-west-1"
  stream_name = "vector-hostname"

How do I debug this issue (stracing the process seems impractical given its intermittent nature)? I'm grateful for any suggestion of what to try.

aws_cloudwatch_logs bug

Most helpful comment

I've now been running fix-cw-panic c974895 for 48 hours without incident, so I think we can consider this case addressed. @LucioFranco : #663 should probably mention that it fixes this case.

All 9 comments

@bittrance thanks for reporting this. This seems like this stems from AWS abruptly ending the TCP connection. The issue is that we are hitting this blockg of code https://github.com/timberio/vector/blob/master/src/sinks/aws_cloudwatch_logs.rs#L360 which cause the tx to be dropped here. This thus notifies the poll_ready for the next request to wake up and error out.

The best solution I can think of is to retry the put_logs request. I think the best way to achieve this is to just wrap this https://github.com/timberio/vector/blob/master/src/sinks/aws_cloudwatch_logs.rs#L248 call with a loop_fn that retries up to a fixed amount. After that I do not think there is much that can be done. This solution should be able to solve this issue with the most minimal amount of code changes.

The other option might be to try and use tower-retry to actually retry the request but I'm not sure how that will interact without somewhat complicated state machine that powers the cloudwatch partition service.

As far as I can tell, the log events that were in the truncated request do not appear to have been accepted by CWL. I think I will proceed on the assumption that this is an issue that should be addressed by retries.

with a loop_fn that retries up to a fixed amount

I'm trying to make this work, but I'm clearly not as smart as I need to be. This simple method https://gist.github.com/bittrance/8f9fd9e2e61ef34f7f4ba844c4c6dbec works, but when implemented in the CWL sink it fails compilation with this mysterious message.

error[E0277]: `(dyn futures::future::Future<Error = rusoto_logs::generated::DescribeLogStreamsError, Item = rusoto_logs::generated::DescribeLogStreamsResponse> + std::marker::Send + 'static)` cannot be shared between threads safely
   --> src/sinks/aws_cloudwatch_logs.rs:239:9
    |
239 |         RusotoFuture::from_future(
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^ `(dyn futures::future::Future<Error = rusoto_logs::generated::DescribeLogStreamsError, Item = rusoto_logs::generated::DescribeLogStreamsResponse> + std::marker::Send + 'static)` cannot be shared between threads safely
    |
    = help: the trait `std::marker::Sync` is not implemented for `(dyn futures::future::Future<Error = rusoto_logs::generated::DescribeLogStreamsError, Item = rusoto_logs::generated::DescribeLogStreamsResponse> + std::marker::Send + 'static)`
    = note: required because of the requirements on the impl of `std::marker::Sync` for `std::ptr::Unique<(dyn futures::future::Future<Error = rusoto_logs::generated::DescribeLogStreamsError, Item = rusoto_logs::generated::DescribeLogStreamsResponse> + std::marker::Send + 'static)>`

Note that it is complaining about a different signature than the one in this method. There are similar complaints about all the other futures in the file. I don't get it.

Also, this solution has the annoying property of needing to clone the request. Is there no way to avoid that? A PutLogEventsRequest can be largish.

@bittrance so I think the approach I am taking in #663 is actually much simpler than what I proposed mostly because it retries the whole stack of requests to ensure there are not state issues in between retries.

As for the cloning of the request, I dont think there is any way around this. This is due to the fact that futures 0.1 require moving and no references. What this means is that we are bound by the rusoto type https://docs.rs/rusoto_logs/0.37.0/rusoto_logs/struct.PutLogEventsRequest.html#structfield.log_events and the fact that it takes a Vec which is a single owned type. Ideally something like Arc<Vec<_>> would work better since really to send the events you just need to read them. But we are not able to do this now due to rusoto's types. So the only way is to clone the request. From what I have seen though this clone _only_ happens when a retry is actually done. So this case should be pretty rare.

@LucioFranco #663 indeed seems more reasonable, though it is worth noting that according to the API docs at https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html the API may reject individual items in the batch, which it will be tricky to accomodate in this style, I think.

I've now upgraded to fiw-cw-panic c974895. We'll see if it survives longer. I really appreciate your attention 鉂わ笍 .

@bittrance no problem, so those rejected log errors, I'm actually not sure what the right approach is there. I don't think there is much we can do since most the rejections are due to the timestamp.

I've now been running fix-cw-panic c974895 for 48 hours without incident, so I think we can consider this case addressed. @LucioFranco : #663 should probably mention that it fixes this case.

@LucioFranco I've now followed up a week's worth of data and it generally looks good. I've had a few "connection closed" issues during this period, which has been correctly retried. Source file and destination log streams are effectively identical across these occurrences. However, there is one case that has occurred at least twice during this period:

Jul 25 04:22:32.887 WARN sink{name="sink__burt_staging-us1_hub_africa_access_log"}: vector::sinks::util::retries: unexpected error type: request timed out
Jul 25 04:22:32.887 ERROR sink{name="sink__burt_staging-us1_hub_africa_access_log"}: log: request failed. error=request timed out  log.target="vector::sinks::util" log.module_path="vector::sinks::util" log.file="src/sinks/util/mod.rs" log.line=200
Jul 25 04:22:32.887 ERROR sink{name="sink__burt_staging-us1_hub_africa_access_log"}: vector::sinks::util: request failed. error=request timed out

I think this is the Tower timeout module https://github.com/tower-rs/tower/blob/716bafd9222e491e9e4e9b394aa9dbb13825b6f5/tower-timeout/src/error.rs#L13 throwing in the towel, which should also be retried.

@bittrance awesome, thank you for reporting back on this, it has been super helpful!

This PR https://github.com/timberio/vector/pull/691 should fix the timeout issue!

@LucioFranco Excellent. I've deployed a build based on dabbedd but it may well be several days before the issue occurs, so you should probably merge #691 when you feel it is ready and let it close this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

binarylogic picture binarylogic  路  3Comments

LucioFranco picture LucioFranco  路  3Comments

a-rodin picture a-rodin  路  3Comments

kaarolch picture kaarolch  路  3Comments

LucioFranco picture LucioFranco  路  3Comments