Jaeger: Agent requires long connection timeout on AWS ELB

Created on 6 Jan 2018  路  9Comments  路  Source: jaegertracing/jaeger

Myself and @ledor473 were independently experiencing timeout issues when using an ELB in front of the Jaeger Collector within AWS. Concretely the connection configuration looked like the following:

[Application Container --> Jaeger Agent] --> Amazon Internal ELB --> [Jaeger Collector]

For reference anything in [] can be thought of as running in a containerized environment together, eg a Pod in Kubernetes or a Task in AWS ECS.

The behavior that I observed were repeated errors in my Jaeger Agent logs like the following:

"peerlistmgr/peer_list_mgr.go:157","msg":"Not enough connected peers","connected":0,"required":1}
"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"<aws-internal-elb-redacted>:14267"}
"peerlistmgr/peer_list_mgr.go:176","msg":"Connected to peer","host:port":"[::]:14267"}

If I changed the agent's collector configuration to point directly to the collector (No LB) the errors subsided completely.

At the suggestion of @ledor473, both of us upped the connection timeout of our AWS ELBs from the default 60 seconds to 3600 seconds and the issues totally abated.

bug

All 9 comments

@benjigoldberg according to this we cannot set the timeout value https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

Is the agent talking to collection behind elb on tchannel?

@blackgold that documentation is for the Network Load Balancer (NLB). If you use the Classic Load Balancer (ELB), you can configure it.

@ledor473 i assumed that the agent was using tchannel to communicate with collector and a NLB was setup.
I didnit know that ELB works with agent

It does use TChannel.

Have you tested if the NLB reacts the same way the ELB does?

According to the documentation, they might react differently:

In the ELB documentation:

Note that TCP keep-alive probes do not prevent the load balancer from terminating the connection because they do not send data in the payload.

In the NLB documentation:

Elastic Load Balancing sets the idle timeout value to 350 seconds. You cannot modify this value. Your targets can use TCP keepalive packets to reset the idle timeout.

just random ping - if you guys are using / considering Jaeger, consider commenting on https://github.com/jaegertracing/jaeger/issues/207

@yurishkuro done!

FWIW, we have Jaeger agents communicating through an NLB and did not run into this issue.

@ledor473, @benjigoldberg is this still happening, or may I close this one?

Good to close

Was this page helpful?
0 / 5 - 0 ratings

Related issues

NeoCN picture NeoCN  路  4Comments

pavolloffay picture pavolloffay  路  3Comments

saulshanabrook picture saulshanabrook  路  4Comments

albertteoh picture albertteoh  路  3Comments

devoxel picture devoxel  路  5Comments