Hi,
I've been playing with Sleuth, Zipkin and Kafka a bit lately, and noticed some exceptions when using relatively high request volumes.
I have some demo code here: https://github.com/barrycommins/zipkindemo/tree/master
It's just a simple Spring Boot web app with one @RestController.
Traces are sent to the Zipkin server with Kafka as the transport
Under normal load, everything looks fine:

However, under higher load (just simulated with JMeter), I quickly start seeing java.lang.IllegalStateException: Queue full
Here's the stacktrace:
java.lang.IllegalStateException: Queue full
at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.8.0_60]
at org.springframework.cloud.sleuth.stream.StreamSpanReporter.report(StreamSpanReporter.java:86) ~[spring-cloud-sleuth-stream-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.cloud.sleuth.trace.DefaultTracer.close(DefaultTracer.java:123) ~[spring-cloud-sleuth-core-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.cloud.sleuth.instrument.web.TraceHandlerInterceptor.afterCompletion(TraceHandlerInterceptor.java:153) ~[spring-cloud-sleuth-core-1.0.9.RELEASE.jar:1.0.9.RELEASE]
at org.springframework.web.servlet.HandlerExecutionChain.triggerAfterCompletion(HandlerExecutionChain.java:170) ~[spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1055) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:980) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:897) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:622) [tomcat-embed-core-8.5.5.jar:8.5.5]
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846) [spring-webmvc-4.3.3.RELEASE.jar:4.3.3.RELEASE]`
The traces then don't seem to be closed normally:

I guess this could be related to https://github.com/spring-cloud/spring-cloud-sleuth/issues/378 ?
Is there any way to handle this, or is the solution to just reduce the sample rate?
It doesn't seem to be an issue when sending traces to Zipkin via HTTP.
Thanks
Yeah sampling is for that. You can also provide a custom poller:
@Configuration
public static class CustomPollerConfiguration {
@Bean(name = StreamSpanReporter.POLLER)
PollerMetadata customPoller() {
PollerMetadata poller = new PollerMetadata();
poller.setMaxMessagesPerPoll(500);
poller.setTrigger(new PeriodicTrigger(5000L));
return poller;
}
}
Another thing that's interesting is that we're actually breaking your production code? @adriancole I guess we should not break the production code when the queue gets filled up. Maybe it would be better to log an exception but not break miserably? WDYT?
Hi @marcingrzejszczak , just demo code for now :-)
Would it be possible to maybe just start discarding spans if the queue is full?
It would probably be preferable to lose traces rather than throwing unhandled exceptions .
A sampling rate of 0.1 (or even 0.01) may be fine for expected traffic, but an unexpected spike should probably not start throwing exceptions if the rest of the code can cope. As you say, a custom poller would probably solve that. I'll definitely give it a try.
I've tried similar tests using HTTP as the transport mechanism instead of Kafka and haven't seen any issues.
Would it be possible to maybe just start discarding spans if the queue is full?
It would probably be preferable to lose traces rather than throwing unhandled exceptions .
I fully agree with this one
I've tried similar tests using HTTP as the transport mechanism instead of Kafka and haven't seen any issues.
That's what bothers me. Don't yet know the answer but most likely we're faster on the HTTP side then we are on the Stream side. Also AFAIR @adriancole has done some tweaking with async reporter for HTTP (he's put a lot of work in that) so I guess his work makes that possible. We'll have to somehow sync it with the stream approach.
It's probably not helping that I'm using Kafka in a Docker container, with Spring Cloud Stream's default producer configuration.
Kafka usually needs quite a lot of tuning compared to HTTP, so I suppose it's not really comparing like with like.
Now it should be better. We're suppressing the exception and increasing the counter of dropped spans.
Wow, that was fast!
Thanks, @marcingrzejszczak
Most helpful comment
Now it should be better. We're suppressing the exception and increasing the counter of dropped spans.