Spring-cloud-sleuth: Zuul: Spans are removed from MDC "too early" when SendErrorFilter forwards to /error

Created on 12 Apr 2017  路  17Comments  路  Source: spring-cloud/spring-cloud-sleuth

Suppose a servlet filter logging incoming requests configured with a priority such that it appears just after Sleuth's TraceFilter. The filter would log a _pre_ and _post_ message just like Spring's CommonRequestLoggingFilter. Thanks to Sleuth integration, both lines will have the _spanid_ attached to them as shown in below sample log output:

2017-04-12 19:07:09.090  INFO [-,b1fe361d6c1d8c35,b1fe361d6c1d8c35,false] 73856 --- [nio-8080-exec-1] c.e.ZuulsleuthApplication$MyFilter       : BEFORE
2017-04-12 19:07:09.098  INFO [-,b1fe361d6c1d8c35,b1fe361d6c1d8c35,false] 73856 --- [nio-8080-exec-1] o.s.c.n.zuul.web.ZuulHandlerMapping      : Mapped URL path [/test/**] onto handler of type [class org.springframework.cloud.netflix.zuul.web.ZuulController]
...
2017-04-12 19:07:09.120  INFO [-,b1fe361d6c1d8c35,b1fe361d6c1d8c35,false] 73856 --- [nio-8080-exec-1] c.e.ZuulsleuthApplication$MyFilter       : AFTER

The SendErrorFilter is triggered whenever Zuul encounters a problem during the routing of the request (either via SimpleRouting or RibbonRouting). The filter forwards the request to the /error endpoint for handling. In this case it appears the span information is NOT present in the MDC when the execution flow returns to the logging ServletFilter - hence the _AFTER_ message doesn't have the spanid attached to it. Example:

2017-04-12 19:44:21.307  INFO [-,87ac57c7bbda596d,87ac57c7bbda596d,false] 74041 --- [nio-8080-exec-1] c.e.ZuulsleuthApplication$MyFilter       : BEFORE
2017-04-12 19:44:21.315  INFO [-,87ac57c7bbda596d,87ac57c7bbda596d,false] 74041 --- [nio-8080-exec-1] o.s.c.n.zuul.web.ZuulHandlerMapping      : Mapped URL path [/test/**] onto handler of type [class org.springframework.cloud.netflix.zuul.web.ZuulController]
2017-04-12 19:44:21.395  WARN [-,87ac57c7bbda596d,c80a4aef1716bc74,false] 74041 --- [nio-8080-exec-1] o.s.c.n.z.filters.post.SendErrorFilter   : Error during filtering

com.netflix.zuul.exception.ZuulException: Connect to localhost:1 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused
    at org.springframework.cloud.netflix.zuul.util.ZuulRuntimeException.<init>(ZuulRuntimeException.java:33) ~[spring-cloud-netflix-core-1.3.0.RC1.jar:1.3.0.RC1]
    at org.springframework.cloud.netflix.zuul.filters.route.SimpleHostRoutingFilter.run(SimpleHostRoutingFilter.java:200) ~[spring-cloud-netflix-core-1.3.0.RC1.jar:1.3.0.RC1]
...

2017-04-12 19:44:21.464  INFO [-,,87ac57c7bbda596d,false] 74041 --- [nio-8080-exec-1] c.e.ZuulsleuthApplication$MyFilter       : AFTER

Look at the last line: the traceid is still present, but the spanid is gone (or is it the opposite?).

Although we discovered that behaviour with Brixton.SR7, it can be reproduced with Dalston.RC1 as well.

bug

Most helpful comment

Hope you feel better now ;-)

All 17 comments

Attached is a sample application that can be used to reproduce the issue.
Start the com.example.ZuulsleuthApplication as usual then curl http://localhost:8080/test.

zuulsleuth.zip

Thanks for the detailed description and the sample and sorry for the inconvenience. Could you please upload the sample as a Github project? It will be easier for me to operate on it

Sample is available at https://github.com/brenuart/spring-cloud-sleuth-gh563, have fun ;-)

Haha thanks :D

Could you confirm the issue?

I'll take a look at that today

Thx - just wanted to know if its a bug, feature or mistake from us...

No problem - I was on a sick leave that's why some issues left without any response. Sorry for the delay

Hope you feel better now ;-)

Replicated and most likely fixed. I'll run some tests and will push this to a branch so that you can confirm that it's working fine. I've ran it on your sample and it seems to be looking good.

Can you check out this branch issues_#563_improving_zuul_error and see if everything is ok?

@marcingrzejszczak Sorry we could not test your fix due to lack of time.

The problem was caused by Zuul _forwarding_ the request directly to the error endpoint instead of letting the exception bubble up and relying on Tomcat to forward it to its default error page.
Our workaround was to slightly change the SendErrorFilter to remove that behaviour and instead set a special "error" attribute on the request. That attribute is later checked by a custom filter registered before Sleuth's TraceFilter and forward the request to /error if needed.
No forwarding anymore, and no problem...

However, I wonder if the behaviour described in this issue will affect any forwarded request.
What about standard SpringBoot application with the error handling configured to forward to /error instead of using @ExceptionHandler ?

Oh, btw - there are a bit more consequences to the issue than the simple "span/trace not present anymore on the MDC". We noticed the SpanContextHolder still hold a _current span_ after an error is handled/received by Zuul. Subsequent requests handled by the same (Tomcat) thread will therefore attempt to continue it...

@brenuart so you're saying that after introducing this change it's wrong due to

We noticed the SpanContextHolder still hold a current span after an error is handled/received by Zuul. Subsequent requests handled by the same (Tomcat) thread will therefore attempt to continue it...

As for the /error - in the TraceFilter we're already handling that case so that should work but I guess I'd need a sample to check that with Zuul.

Not exactly - sorry for my english :(

I was just saying that my original issue description was only partial. The symptoms of the problem were:

  • missing traceId on the MDC and therefore on the logs
  • but _also_ that the traceId were likely to be "reused" by subsequent requests handled on the same thread if they had no traceId headers

I though it might be useful to add this extra information for the record in case someone runs into the same problem.

About your fix, as I said, we haven't had the chance yet to test it. We are busy trying to migrate to the latest versions...

Ah ok. So for now I'm assuming that it's working fine. If you have time and it's not working please feel free to comment and we'll reopen this ok?

Ok.

Was this page helpful?
0 / 5 - 0 ratings