Vector: TCP source: "Socket not connected"

Created on 3 Jun 2020 · 15Comments · Source: timberio/vector

When using the socket source with type TCP and creating short-lived connections, vector quickly crashes with the following error:

source{name=in type=socket}: vector::sources::util::tcp: failed to accept socket error=Could not get peer address: Socket not connected (os error 107)

I narrowed the configuration down to the following:

[sources.in]
  address = "0.0.0.0:5514"
  mode = "tcp"
  type = "socket"

[sinks.blackhole]
  type = "blackhole"
  inputs = ["in"]
  print_amount = 1000

I run it as a docker image like so:
docker run -it --network host -v $(pwd):/etc/vector/config timberio/vector:nightly-alpine --config /etc/vector/config/config.toml

And then I send something via netcat, directly closing the connection afterwards. Sometimes it takes a couple attempts, but it does consistently crash at some point. You should definitely be able to reproduce it with something like:
while true; do echo "hi" | nc -c -v localhost 5514; done

When I connect to it without the -c flag (i.e. keeping the connection open), everything is fine. The error appears to only happen when the connection is closed directly after sending.

Log level trace doesn't reveal any more information, btw.

networking socket bug

Source

joscha-alisch

All 15 comments

I think I know what the problem is, but we are having difficulty reproducing the problem here. I have tried with current master branch, version 0.9.2, and the nightly docker image, and ran the nc loop for several minutes each, with at least 100,000 connections, without observing any issues.

Could you tell us a little more about your environment? What kind of hardware and software environment is this running in?

bruceg on 4 Jun 2020

Hi @bruceg,

thanks for looking into it!

The problem occurs both when running the vector image on my machine and in our k8s cluster. Also I just tried building from source and have a very similar error message, however slightly different:

ERROR source{name=in type=socket}: vector::sources::util::tcp: failed to accept socket error=Could not get peer address: Transport endpoint is not connected (os error 107)

So I don't think it is really related to where vector runs. However, maybe this is a very specific problem with nc? I'm running GNU netcat 0.7.1:

$ nc -h
GNU netcat 0.7.1, a rewrite of the famous networking tool.

on arch:

 OS: Arch Linux x86_64
 Kernel Release: 5.6.15-arch1-1
 RAM: 2455 MB / 31904 MB
 Processor Type: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Is this enough info? Happy to provide more if needed!

edit: Here is a screen capture of the problem occuring:
error

joscha-alisch on 5 Jun 2020

Hmm, still can't reproduce, and I've tried with GNU netcat, OpenBSD netcat, and tcpclient from ucspi-tcp, and Vector in both debug and release mode. @Hoverbear also tried. I do believe this is a real problem that we should resolve, it's just frustrating we can't produce a test case that can demonstrate the problem so we can know we've fixed it. Since you indicate you can build from source, I will write a patch for what I think is the problem and see if that resolves the issue for you.

bruceg on 5 Jun 2020

👍1

Please try to patch in #2750 and let us know if that resolves the issue for you.

bruceg on 5 Jun 2020

Hi @bruceg ,

thanks again for investing the time, even more so on such a frustrating issue :)

I tested your patch and unfortunately, it does not fix the issue. But you were so close!
I started the debugger and looked at the place where you added the fix. The error I get is actually not a TlsError::Connect, but a TlsError::PeerAddress. The underlying os error is indeed io::ErrorKind::NotConnected though.

I modified the error handling there in the following way:

.map_err(|error| match &error {
    TlsError::Connect { source } => match source.kind() {
        io::ErrorKind::NotConnected => (),
        _ => error!(message = "failed to accept socket", %error),
    },
    TlsError::PeerAddress { source } => match source.kind() {
        io::ErrorKind::NotConnected => (),
        _ => error!(message = "failed to accept socket", %error),
    },
    _ => error!(message = "failed to accept socket", %error),
})

This does indeed catch the error, but something is still going wrong further up the call stack:

ERROR vector::topology: Unhandled error

So I guess the exception is still thrown up? This is my first time doing anything with Rust, so I'm a bit lost as to how this works to be honest. Otherwise, I'd be happy to send a patch with the fix.

_One other thought:_ does it maybe make sense to always ignore it, if the underlying error is io::ErrorKind::NotConnected, no matter what the surface TlsError is? Again not sure how that would be done in Rust, but seems sensible to me :)

joscha-alisch on 6 Jun 2020

When using the socket source with type TCP and creating short-lived connections, vector quickly crashes with the following error

Could you clarify something? When you say that vector crashes, does it simply emit the error message and continue, or does it stop accepting connections (either because it exits or for some other reason)?

bruceg on 8 Jun 2020

It prints the error message and shuts down.

joscha-alisch on 8 Jun 2020

Ok, that makes it a completely different bug than I was thinking. The more important error message is caught just briefly in your terminal image, and that is the ERROR vector::topology: Unhandled error. I will dig into that and probably close the first patch, as it does not seem to have anything directly to do with the crash issue.

bruceg on 11 Jun 2020

Oh sorry, I realize now that I just made that connection between the peer address issue and the unhandled error when writing this issue and possibly that is not even connected in that way.

Here is the full log when encountering the problem:
```

docker run -it --network host -v $(pwd):/etc/vector/config timberio/vector:nightly-alpine --config /etc/vector/config/config.toml -vvv

Jun 11 09:31:03.664 INFO vector: Log level "trace" is enabled.
Jun 11 09:31:03.664 INFO vector: Loading configs. path=["/etc/vector/config/config.toml"]
Jun 11 09:31:03.665 TRACE vector: Parsing config. path="/etc/vector/config/config.toml"
Jun 11 09:31:03.668 INFO vector: Vector is starting. version="0.10.0" git_version="v0.9.0-186-gd5e988a" released="Thu, 04 Jun 2020 14:40:54 +0000" arch="x86_64"
Jun 11 09:31:03.670 INFO vector::topology: Running healthchecks.
Jun 11 09:31:03.670 INFO vector::topology: Starting source "in"
Jun 11 09:31:03.670 INFO vector::topology: Starting sink "blackhole"
Jun 11 09:31:03.670 INFO vector::topology::builder: Healthcheck: Passed.
Jun 11 09:31:03.670 INFO source{name=in type=socket}: vector::sources::util::tcp: listening. addr=0.0.0.0:5514
Jun 11 09:31:13.805 ERROR source{name=in type=socket}: vector::sources::util::tcp: failed to accept socket error=Could not get peer address: Socket not connected (os error 107)
Jun 11 09:31:13.805 INFO vector::shutdown: All sources have finished.
Jun 11 09:31:13.805 INFO vector: Shutting down.
Jun 11 09:31:13.805 ERROR vector::topology: Unhandled error

joscha-alisch on 11 Jun 2020

By the way: I'm happy to jump on a call if need be and you'd like to see/debug this hands-on. (We might have a bit of a timezone issue but we could probably figure something out) :)

joscha-alisch on 11 Jun 2020

👍1

@ktff Hey! :) Do you think you could connect with Bruce and puzzle this one out with him?

Hoverbear on 22 Jun 2020

I think the errors from socket source are justified. They are caused by the connection being quickly closed after connecting. At that point the connection could be in various states of processing. One of this states is when we are trying to fetch peer_addr but the connection has been already closed hence the error

ERROR source{name=in type=socket}: vector::sources::util::tcp: failed to accept socket error=Could not get peer address: Transport endpoint is not connected (os error 107)

The bug here is that those are not catastrophic failures. We can certainly continue running, listening for other connections, and even process the messages we got from the connection in question. The fix should catch this error and similar ones at call sites and deal with it there. Example, for the up error I think it originates here, but it would good to comb through the call stack and find other such places.

ktff on 24 Jun 2020

2750 was on trail here, it's just that at that point in `futures01` error closes the listening stream, which closes the source, which for this example means all source have been closed, so vector naturally shuts down. That's why those errors should be caught earlier.

ktff on 24 Jun 2020

Facing similar error, but for source type=vector, Currently using latest deb for version 0.10.0 version="0.10.0" git_version="v0.9.0-377-g0f0311a"

Aug 04 09:33:15 vector-aggregator01 vector[6632]: Aug 04 09:33:15.402 ERROR source{name=vector_aggregator_listener type=vector}: vector::sources::util::tcp: failed to accept socket error=Could not get peer address: Transport endpoint is not connected (os error 107)
Aug 04 09:33:15 vector-aggregator01 vector[6632]: Aug 04 09:33:15.402 ERROR source{name=vector_aggregator_listener type=vector}: vector::topology: An error occurred that vector couldn't handle.

Let me dive a little into the architecture I am using. I have 2 log aggregation nodes in listen configuration with following configuration. vector daemon from other nodes send logs to this node for further ingestion and processing:

[sources.vector_aggregator_listener]
type = "vector"
address = "0.0.0.0:22222"

Both nodes are behind haproxy with configuration:

frontend vector_aggregators
        bind *:24284
        option tcp-smart-accept
        option splice-request
        option splice-response
        mode tcp
        timeout client 120s
        timeout server 120s
        default_backend vector_aggregators_backend

backend vector_aggregators_backend
        balance leastconn
        option tcp-smart-connect
        option splice-request
        option splice-response
        mode tcp
        timeout tunnel 30m
        fullconn 700
        server vector-aggregator02 <ip2>:22222        maxconn 1200 weight 1 check
        server vector-aggregator01 <ip1>:22222        maxconn 1200 weight 1 check

Overall the architecture looks like this:
(vector agent on each service node) -> (haproxy running on both vector aggregator node in mesh for HA listening on port 24284) -> (vector aggregator listening on port 22222 on both nodes)

We use haproxy as it does better load balancing than DNS based load balancing.

I had a couple of questions at this point:

Is this error happening due to the same issue in the #2750? Looking at git logs for tag 0.10.0, it doesn't seem like this fix is merged.
Is it a good idea to use vector behind load balancer like haproxy? We had a td-agent log aggregation cluster in a similar architecture, but I can understand if vector might have issues with haproxy due to something. This architecture is mostly working except for when vector dies with mentioned error.

Additional notes:

The process exits with exitcode 0 in systemd, I think it should with some error exit code following unix semantics. Again upto devs.
Currently, I have added a systemd override file to restart vector when this happens

# /etc/systemd/system/vector.service.d/01-override-restart.conf
[Unit]
StartLimitIntervalSec=60
StartLimitBurst=5

[Service]
RestartForceExitStatus=0
RestartSec=1