Spring-cloud-sleuth: High volume of requests causes IllegalStateException in StreamSpanReporter

Created on 11 Oct 2016  路  6Comments  路  Source: spring-cloud/spring-cloud-sleuth

Hi,

I've been playing with Sleuth, Zipkin and Kafka a bit lately, and noticed some exceptions when using relatively high request volumes.

I have some demo code here: https://github.com/barrycommins/zipkindemo/tree/master
It's just a simple Spring Boot web app with one @RestController.
Traces are sent to the Zipkin server with Kafka as the transport

Under normal load, everything looks fine:
zipkin trace normal

However, under higher load (just simulated with JMeter), I quickly start seeing java.lang.IllegalStateException: Queue full

Here's the stacktrace:

 java.lang.IllegalStateException: Queue full
at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.8.0_60]
at org.springframework.cloud.sleuth.stream.StreamSpanReporter.report(StreamSpanReporter.java:86) ~[spring-cloud-sleuth-stream-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.cloud.sleuth.trace.DefaultTracer.close(DefaultTracer.java:123) ~[spring-cloud-sleuth-core-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.cloud.sleuth.instrument.web.TraceHandlerInterceptor.afterCompletion(TraceHandlerInterceptor.java:153) ~[spring-cloud-sleuth-core-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.web.servlet.HandlerExecutionChain.triggerAfterCompletion(HandlerExecutionChain.java:170) ~[spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1055) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:980) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:897) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:622) [tomcat-embed-core-8.5.5.jar:8.5.5]
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]`

The traces then don't seem to be closed normally:
zipkin trace error

I guess this could be related to https://github.com/spring-cloud/spring-cloud-sleuth/issues/378 ?

Is there any way to handle this, or is the solution to just reduce the sample rate?
It doesn't seem to be an issue when sending traces to Zipkin via HTTP.

Thanks

bug

Most helpful comment

Now it should be better. We're suppressing the exception and increasing the counter of dropped spans.

All 6 comments

Yeah sampling is for that. You can also provide a custom poller:

@Configuration
    public static class CustomPollerConfiguration {

        @Bean(name = StreamSpanReporter.POLLER)
        PollerMetadata customPoller() {
            PollerMetadata poller = new PollerMetadata();
            poller.setMaxMessagesPerPoll(500);
            poller.setTrigger(new PeriodicTrigger(5000L));
            return poller;
        }
    }

Another thing that's interesting is that we're actually breaking your production code? @adriancole I guess we should not break the production code when the queue gets filled up. Maybe it would be better to log an exception but not break miserably? WDYT?

Hi @marcingrzejszczak , just demo code for now :-)

Would it be possible to maybe just start discarding spans if the queue is full?
It would probably be preferable to lose traces rather than throwing unhandled exceptions .

A sampling rate of 0.1 (or even 0.01) may be fine for expected traffic, but an unexpected spike should probably not start throwing exceptions if the rest of the code can cope. As you say, a custom poller would probably solve that. I'll definitely give it a try.

I've tried similar tests using HTTP as the transport mechanism instead of Kafka and haven't seen any issues.

Would it be possible to maybe just start discarding spans if the queue is full?
It would probably be preferable to lose traces rather than throwing unhandled exceptions .

I fully agree with this one

I've tried similar tests using HTTP as the transport mechanism instead of Kafka and haven't seen any issues.

That's what bothers me. Don't yet know the answer but most likely we're faster on the HTTP side then we are on the Stream side. Also AFAIR @adriancole has done some tweaking with async reporter for HTTP (he's put a lot of work in that) so I guess his work makes that possible. We'll have to somehow sync it with the stream approach.

It's probably not helping that I'm using Kafka in a Docker container, with Spring Cloud Stream's default producer configuration.

Kafka usually needs quite a lot of tuning compared to HTTP, so I suppose it's not really comparing like with like.

Now it should be better. We're suppressing the exception and increasing the counter of dropped spans.

Wow, that was fast!

Thanks, @marcingrzejszczak

Was this page helpful?
0 / 5 - 0 ratings

Related issues

oburgosm picture oburgosm  路  4Comments

beltram picture beltram  路  6Comments

apolischuk picture apolischuk  路  7Comments

marcingrzejszczak picture marcingrzejszczak  路  6Comments

nickcodefresh picture nickcodefresh  路  7Comments