We have updated several applications to Spring Boot 2.1.8 and metric publishing using the StatsD meter registry has stopped. The actuator/metrics endpoint shows metrics being collected as expected. We publish to localhost on port 8125 at a 1m interval (all defaults).
The actuator/configprops endpoint shows the metrics publication is enabled for statsd.
management.metrics.export.statsd-org.springframework.boot.actuate.autoconfigure.metrics.export.statsd.StatsdProperties: {
prefix: "management.metrics.export.statsd",
properties: {
flavor: "DATADOG",
publishUnchangedMeters: true,
port: 8125,
pollingFrequency: {
units: [
"SECONDS",
"NANOS"
]
},
host: "localhost",
maxPacketLength: 1400,
enabled: true
}
}
| Spring Boot Version | Micrometer Version | Result |
|:------------|:----------------|:-------|
| 2.1.8 | 1.1.6 | ❌ metrics not published to statsd |
| 2.1.8 | 1.1.5 | ✅ working |
| 2.1.7 | 1.1.4 | ✅ working |
| 2.1.7 | 1.1.6 | ❌ metrics not published to statsd |
When running Micrometer 1.1.5 and 1.1.4, the Spring Boot java process shows an established connection on an ephemeral port sending to port 8125.
netstat -uv | grep 8125
udp 0 0 localhost:53613 localhost:8125 ESTABLISHED
```bash
sudo netstat -lanp | grep 53613
udp 0 0 ::ffff:127.0.0.1:53613 ::ffff:127.0.0.1:8125 ESTABLISHED 3295/java
UDP Traffic being received with a capture of the UDP traffic using tcpdump.
```bash
sudo tcpdump -i lo -n udp port 8125
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
19:17:34.053140 IP 127.0.0.1.53613 > 127.0.0.1.8125: UDP, length 1260
19:17:34.134113 IP 127.0.0.1.53613 > 127.0.0.1.8125: UDP, length 1261
19:17:34.199557 IP 127.0.0.1.53613 > 127.0.0.1.8125: UDP, length 1261
Utilizing the same command when running the same application with 1.1.6, the java process does not show any ephemeral port established and bound to 8125.
After changing the logging level to DEBUG for io.micrometer.shaded.reactor, I could see many onNextDropped debug messages output for each meter with its associated line data:
2019-09-17 21:51:12,895 DEBUG 3319 (http-nio-5000-exec-2) [Operators] onNextDropped: method.timed:2.426269|ms|#class:LabelJdbcStore,exception:none,method:findLabeledPids
Since Operators uses multicast, that led me to look at ifconfig and finding that multicast is not turned on for the loopback interface for our AWS EC2 instances, but it is for eth0.
ifconfig
eth0 Link encap:Ethernet HWaddr 0A:F3:B8:FE:4E:2E
inet addr:10.36.73.178 Bcast:10.36.79.255 Mask:255.255.240.0
inet6 addr: fe80::8f3:b8ff:fefe:4e2e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:5908501 errors:0 dropped:0 overruns:0 frame:0
TX packets:3941816 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6430044133 (5.9 GiB) TX bytes:934258311 (890.9 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:2918885 errors:0 dropped:0 overruns:0 frame:0
TX packets:2918885 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5714060914 (5.3 GiB) TX bytes:5714060914 (5.3 GiB)
I turned on multicast for the loopback and restarted the Spring Boot app and metrics are now being published.
sudo ifconfig lo multicast
ifconfig lo
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MULTICAST MTU:65536 Metric:1
RX packets:3058498 errors:0 dropped:0 overruns:0 frame:0
TX packets:3058498 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5988434965 (5.5 GiB) TX bytes:5988434965 (5.5 GiB)
When the loopback interface is configured without multicast, metrics are published when using Micrometer 1.1.5. The eth0 interface has multicast turn on so I'm wondering if that interface was being selected in 1.1.5 and now it isn't in 1.1.6. I haven't been able to narrow down what change in 1.1.6 caused this to stop working.
Not sure if it's the same issue, but I have noticed some of our applications using Micrometer with StatsD recently have stopped publishing metrics at seemingly random moment, but all the same time. They recently seem to have been upgraded from Micrometer 1.1.5 to 1.1.6, as managed by respectively Spring Boot 2.1.7.RELEASE and 2.1.8.RELEASE.
Will try downgrading to Micrometer 1.1.5 via Spring Boot's Maven property to see if this helps.
<properties>
<micrometer.version>1.1.5</micrometer.version><!-- Temporary workaround for StatsD bug: https://github.com/micrometer-metrics/micrometer/issues/1591 -->
</properties>
If I were to guess, the upgrade from reactor-netty 0.8.1 to 0.8.11 (see #1574 and #1561) is a likely candidate for the difference in behavior here. In previous patches, we never upgraded the shaded version of reactor-netty we were using. I will have to look into this more to confirm the root cause and what to do about this, though.
Thank you for the thorough investigation @edwardsre and write-up of the issue.
As a workaround, I've downgraded reactor-netty in #1613 which restores the previous behavior as far as my testing can tell.
Request to affected users
Would those affected by this issue please test with Micrometer 1.1.7-SNAPSHOT or 1.1.7 when it is released to confirm this fixes things for now?
I'm leaving this issue open as we still want to identify the root cause and figure out where a fix needs to happen so we are not forever stuck on older versions.
In the cases I witnesses and heard about, the application has been running for a couple of days already. Does anyone have a quicker way to reproduce/test?
@shakuzen I heard from colleagues that they also encountered this with Micrometer 1.2.0. Is that possible and does reactor-netty maybe also need a downgrade in the 1.2.x branch?
@shakuzen I see Micrometer 1.1.7, 1.2.2 and 1.3.0 have been released. You say reactor-netty was downgraded in 1.1.7, which seems to be have been released now, and the 1.3.0 release notes mention all changes in 1.1.7 and 1.2.2 are also incorporated into 1.3.0. However, the 1.3.0 release notes also mention this issue as a known issue. So, is the issue fixed or not in 1.1.7, 1.2.2 and 1.3.0?
@shakuzen I have deployed one of our affected applications with the 1.1.7 release and it seems to have remedied the reported problem seen with 1.1.6. For us, it did not take days for the break to occur as suggested by @breun above, but was immediate upon deploy. Either way, I will watch it for a while to make sure it doesn't stop working.
Thanks for leaving this open and continuing to look into the root cause in reactor-netty and netty.
However, the 1.3.0 release notes also mention this issue as a known issue. So, is the issue fixed or not in 1.1.7, 1.2.2 and 1.3.0?
The workaround of downgrading the reactor-netty and netty dependencies is in the 1.1.7 and 1.2.2 release. 1.3.0 upgrades to the recent reactor-netty 0.9.0 minor release, and there were some compatibility issues with downgrading that, so the issue is still present in 1.3 for now. Once the root cause of this issue is identified and fixed, we'll get out a patch release with the fix.
@edwardsre Thank you for trying out the release and confirming things are back to working as expected so far.
Does not work for me with Micrometer 1.1.7 & Spring Boot 2.1.9.
Works with Micrometer 1.1.5 & Spring Boot 2.1.9.
Spring Boot 2.2.0 (comes with Micrometer 1.3.0) also does not work.
Could someone clarify what's the recommendation for Spring Boot users?
Stick with the latest Spring Boot and use Micrometer 1.1.5?
Have you tried using Spring Boot 2.2.0.RELEASE and setting <micrometer.version>1.1.7</micrometer.version> in your Maven pom.xml properties (or the equivalent in Gradle if you use that) to see if Spring Boot 2.2.0.RELEASE is compatible with Micrometer 1.1.7?
If not, I guess you need to stick with Spring Boot 2.1.x until this issue is fixed in Micrometer 1.3.x.
I have investigated this more and the issue seems to be that the behavior of the reactor-netty UdpClient has changed:
before reactor-netty 0.8.7 / netty 4.1.36
UdpClient continues to send statsd lines even if the statsd host isn't accepting the connection.
since reactor-netty 0.8.7 / netty 4.1.36
UdpClient stops sending statsd lines if it fails to send twice.
Note that there should be no problem as long as the statsd daemon is started and accepting lines before the application starts sending statsd lines, and the application is able to continually send lines to the daemon. If, however, the statsd daemon is not accepting the lines sent by the application, the application will stop sending lines with the new versions of netty and reactor-netty.
I don't know if the change in behavior is intentional or not. Maybe it was caused by netty, and not something directly in reactor-netty. Perhaps @violetagg can speak to that.
In Micrometer, we could obviate the issue by implementing recovery logic for failing to send statsd lines via UDP as was already requested for TCP in #1212. Otherwise we would need a way to get the previous behavior from the underlying reactor-netty UdpClient.
@shakuzen Can you test 0.8.13 Reactor Netty + 4.1.43 Netty because we have changes there?
Can you test 0.8.13 Reactor Netty + 4.1.43 Netty because we have changes there?
@violetagg I have tried with 0.8.13 Reactor Netty + 4.1.43 Netty and it has the same behavior as mentioned above for "since reactor-netty 0.8.7 / netty 4.1.36".
I have updated the title of the issue to reflect my understanding after my testing showed consistent behavior regardless of multicast. I will open a separate issue for Micrometer 1.3.1 to temporarily downgrade our shaded versions of reactor-netty / netty to 0.8.6 / 4.1.34 which have the previous behavior. I'll leave this issue open to track the long-term solution.
Does it mean this issue has been fixed in version 1.3.1 by downgrading the netty packages?
Does it mean this issue has been fixed in version 1.3.1 by downgrading the netty packages?
That is my expectation. It is more of a workaround than the long-term solution. I'm leaving this issue open until we have a long term solution that allows us to upgrade to the latest versions of dependencies.
Let us know if this isn't fixed for you with 1.3.1. In my testing, it appeared to be fixed.