Spring-cloud-sleuth: High volume of requests causes IllegalStateException in StreamSpanReporter

Created on 11 Oct 2016 · 6Comments · Source: spring-cloud/spring-cloud-sleuth

Hi,

I've been playing with Sleuth, Zipkin and Kafka a bit lately, and noticed some exceptions when using relatively high request volumes.

I have some demo code here: https://github.com/barrycommins/zipkindemo/tree/master
It's just a simple Spring Boot web app with one @RestController.
Traces are sent to the Zipkin server with Kafka as the transport

Under normal load, everything looks fine:
zipkin trace normal

However, under higher load (just simulated with JMeter), I quickly start seeing java.lang.IllegalStateException: Queue full

Here's the stacktrace:

 java.lang.IllegalStateException: Queue full
at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.8.0_60]
at org.springframework.cloud.sleuth.stream.StreamSpanReporter.report(StreamSpanReporter.java:86) ~[spring-cloud-sleuth-stream-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.cloud.sleuth.trace.DefaultTracer.close(DefaultTracer.java:123) ~[spring-cloud-sleuth-core-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.cloud.sleuth.instrument.web.TraceHandlerInterceptor.afterCompletion(TraceHandlerInterceptor.java:153) ~[spring-cloud-sleuth-core-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.web.servlet.HandlerExecutionChain.triggerAfterCompletion(HandlerExecutionChain.java:170) ~[spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1055) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:980) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:897) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:622) [tomcat-embed-core-8.5.5.jar:8.5.5]
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]`

The traces then don't seem to be closed normally:
zipkin trace error

I guess this could be related to https://github.com/spring-cloud/spring-cloud-sleuth/issues/378 ?

Is there any way to handle this, or is the solution to just reduce the sample rate?
It doesn't seem to be an issue when sending traces to Zipkin via HTTP.

Thanks

bug

Source

barrycommins

Most helpful comment

Now it should be better. We're suppressing the exception and increasing the counter of dropped spans.

marcingrzejszczak on 11 Oct 2016

👍2

All 6 comments

Yeah sampling is for that. You can also provide a custom poller:

@Configuration
    public static class CustomPollerConfiguration {

        @Bean(name = StreamSpanReporter.POLLER)
        PollerMetadata customPoller() {
            PollerMetadata poller = new PollerMetadata();
            poller.setMaxMessagesPerPoll(500);
            poller.setTrigger(new PeriodicTrigger(5000L));
            return poller;
        }
    }

Another thing that's interesting is that we're actually breaking your production code? @adriancole I guess we should not break the production code when the queue gets filled up. Maybe it would be better to log an exception but not break miserably? WDYT?

marcingrzejszczak on 11 Oct 2016

Hi @marcingrzejszczak , just demo code for now :-)

Would it be possible to maybe just start discarding spans if the queue is full?
It would probably be preferable to lose traces rather than throwing unhandled exceptions .

A sampling rate of 0.1 (or even 0.01) may be fine for expected traffic, but an unexpected spike should probably not start throwing exceptions if the rest of the code can cope. As you say, a custom poller would probably solve that. I'll definitely give it a try.

I've tried similar tests using HTTP as the transport mechanism instead of Kafka and haven't seen any issues.

barrycommins on 11 Oct 2016

Would it be possible to maybe just start discarding spans if the queue is full?
It would probably be preferable to lose traces rather than throwing unhandled exceptions .

I fully agree with this one

I've tried similar tests using HTTP as the transport mechanism instead of Kafka and haven't seen any issues.

That's what bothers me. Don't yet know the answer but most likely we're faster on the HTTP side then we are on the Stream side. Also AFAIR @adriancole has done some tweaking with async reporter for HTTP (he's put a lot of work in that) so I guess his work makes that possible. We'll have to somehow sync it with the stream approach.

marcingrzejszczak on 11 Oct 2016

It's probably not helping that I'm using Kafka in a Docker container, with Spring Cloud Stream's default producer configuration.

Kafka usually needs quite a lot of tuning compared to HTTP, so I suppose it's not really comparing like with like.

barrycommins on 11 Oct 2016

Now it should be better. We're suppressing the exception and increasing the counter of dropped spans.

marcingrzejszczak on 11 Oct 2016

👍2

Wow, that was fast!

Thanks, @marcingrzejszczak

barrycommins on 11 Oct 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ClassCastException with webClient

oburgosm · 4Comments

Regression in 3.0.2: tracer context lost when used with spring-data-mongodb

beltram · 6Comments

Spring App Name is not resolved properly

apolischuk · 7Comments

Introduce Reactor instrumentation

marcingrzejszczak · 6Comments

add EnableAspectJAutoProxy(proxyTargetClass = true) be default

nickcodefresh · 7Comments