Elasticsearch version: 5.0.0 (253032b)
Plugins installed: none
JVM version: 1.8.0_102
OS version: Linux 4.4.20-moby (vanilla 5.0 Docker container)
Description of the problem including expected versus actual behavior:
Calling _nodes/stats?all
should return all stats populated, as it did in previous versions and is still documented to work as in the Node Stats documentation:
The
all
flag can be set to return all the stats.
Instead, it returns a 400:
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "request [/_nodes/stats] contains unrecognized parameter: [all]"
}
],
"type" : "illegal_argument_exception",
"reason" : "request [/_nodes/stats] contains unrecognized parameter: [all]"
},
"status" : 400
}
Steps to reproduce:
/_nodes/stats?all
, e.g. http://localhost:9200/_nodes/stats?allCan we restore this functionality please? Given the lack of a documentation update, I'm hoping this was an unintentional and completely unnecessary breaking change.
At a glance, it looks related to #21266 (see https://github.com/elastic/elasticsearch/blob/5.0/core/src/main/java/org/elasticsearch/rest/action/admin/cluster/RestNodesStatsAction.java#L64)
By default, node stats returns all stats. You can explicitly request all stats by hitting /_nodes/stats/_all
or /_nodes/stats?metric=_all
. Requesting /_nodes/stats?all
never did anything, it was just silently ignored in the past. I'll clarify the docs.
Requesting /_nodes/stats?all never did anything, it was just silently ignored in the past
Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and all
used to have a purpose. But that is ancient history
@jasontedor Was this the case in all previous versions? Monitoring systems that hit the APIs need to span many versions, and every breaking change like this (like changing hostname
to host
) create real unnecessary pain points.
This breaks our monitoring system. And anyone else's spanning many versions. Why break this? Why was it necessary? Just ignore the param if the behavior would be the same.
Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and
all
used to have a purpose. But that is ancient history
It's like this in 1.x and 2.x as well, ?all
did nothing, it was just silently ignored.
Was this the case in all previous versions?
It was the case in 1.x and 2.x for sure, I just checked the code.
Why break this? Why was it necessary?
It didn't do anything, it was silently ignored.
Just ignore the param if the behavior would be the same.
This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't.
I pushed the following commits:
This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't.
It _was_ doing something in old versions. If you think no one is still using 0.90, that's just way off base. Unfortunately, that's reality. And they're using some of our stuff to monitor it. So now, for no good reason, instead of maintaining minimal code in Elastic to ignore it, every monitoring system has to maintain a code fork and version awareness to adapt to this. That's not an efficient use of time for anyone.
This is happening with every release, and none of them were necessary. Please, quit breaking things for very minor cleanups. It's just shifting a lot of work to n
consumers.
So we're just supposed to keep all of our mistakes forever? Sorry @NickCraver but no. Strict parsing of query string parameters is a really important step forwards, and we're not going to keep cruft forever. We try hard not to break things. Sometimes we do it unintentionally without realising that something will be an issue, for which I apologise, but it is untenable to keep everything the same forever. It would make our code so creaky and complicated we wouldn't be able to touch it without fear of breaking things.
It was doing something in old versions. If you think no one is still using 0.90, that's just way off base.
That's right, very old versions, and versions that we do not support anymore.
Please, quit breaking things for very minor cleanups.
This change was made almost three years ago in bb275166f1a1f20f5330691ab854d6dc61c51366. The related issue is #4057.
The addition of strict REST parsing was a major, nor minor, step forward for Elasticsearch. The related PR is #20722.
So we're just supposed to keep all of our mistakes forever?
No. But not acknowledging the APIs that _need_ to span versions are more special and should have a much higher bar here shows a large disconnect with what consumes these APIs. I never said _everything_ should work forever. I am speaking about a very narrow scope of APIs here.
Instead of you maintaining support, now _everyone else_ has to deal with it. And all monitoring systems have to have a code fork here to handle it. Is that really the best use of human time?
Also note: this isn't ancient history. It was documented to work an hour ago.
Note there is no "strict REST parsing" for the _nodes
endpoint.
curl '0:9200/_nodes/stats/foobar'
gives a (unexpected) response on an undefined metric "foobar" and is not rejected as invalid request.
@jprante Strict REST parsing (#20722) was about rejecting unrecognized URL parameters, what you're referring to is a separate issue and should be addressed.
@jprante Thanks for reporting this, I have a fix for this and will make a formal PR shortly.
@jasontedor You shouldn't make _another_ breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond?
Also note: this isn't ancient history. It was documented to work an hour ago.
It was a documentation bug and now it's fixed.
You shouldn't make _another_ breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond?
It's consumers that are using the stats endpoints with invalid metrics that are broken but are going undetected because we are lenient here. The change would be a bug fix, and I'm definitely not holding it until a major version. It's tempting to even include it in a bug fix release, but it's definitely going into 5.1.0.
Thanks for reporting @jprante, I opened #21417.
@jasontedor Honest question, do you just not care about breaking users? Does Elastic in general not care a bit about semantic versioning? I'm honestly trying to gauge what the intent is here. From a consumer point of view this is both hostile and toxic. As a maintainer of a project people consume, this is just all around bad.
There's nothing harmful about these endpoints right now. They are experiencing the same behavior from the day they deployed on 5.0. If they upgrade to 5.1.0 with a change like this, you're breaking them. Why? What good does that do? There's good reason breaking changes are reserved for major versions. Most of the software world understands this.
the fact we can now strictly parse parameters is a huge improvement to users, since it allows us to be properly consistent and predictable with our APIs. Something, as anything with writing software, I personally wish I should have done with the get go.
On the other hand, I do appreciate the implications when it comes to backward comp. I would like to hear how you use it, to make sure it is not broken already, and actually, this fix causes things to be fixed? Our goal it _not_ to break backward comp. but I want to understand and appreciate the balance between wrong invocation of API due to mis-design on end that gets fixed, vs. really breaking correct existing behavior? @NickCraver can you share more info?
hey @NickCraver , looking into it, we did make a braking change, and made ?all
a breaking change from 5.0._0_ forward, we just missed the documentation aspect, which we fixed. Strict URL parsing is important for resiliency in our codebase, so regardless, you would need to change 5.x code to support it.
I view strict URL parsing important, and regardless of the docs, which we missed, and I apologize for that, the _code_ still rejects the ?all
flag, from 5.0.0 onwards, which is good. And any monitoring tool working against 5.x versions need to adjust to it.
p.s. I am on a EU TZ, so can't respond soonish moving forward, please don't view me being in-responsives as a non comment.
@kimchy So context here: there is no functionality fix here, the documentation was retroactively changed to "fix" a breaking change. That breaking change was in a major release (good thing), though IMO unnecessary (distinct issue). It should be added to breaking change docs regardless. There's now _another_ change proposed in #21417 that's breaking _not_ in a major version change (bad). That's an issue in itself, and just bad behavior for any project.
The documentation fix is appreciated, but what are we fixing here? The strict parameter parsing is good, but would adding this as a recognized parameter and maintaining behavior be a bad thing? Then when a monitoring system hits these routes, it's not version switched per version. When things like this break, I'm not only fixing our monitoring system (both Bosun and Opserver), everyone else is also doing this. And that's bad, because this break wasn't really necessary.
No one's asking that we revert strict parameter parsing, the ask if much smaller in scope: recognize this parameter. It was allowed and documented to be allowed past the 5.x release (the live docs still have it).
More globally, the monitoring APIs in general are the thing most likely to need many-version support at once. Any changes that aren't _really_ necessary here are painful, multiplied by the number of systems hitting them. As a user, what's the benefit here? How many lines of code would it take in elasticsearch to make this work? The current version would need to _just ignore it_. Is that a big ask? What's happening right now is we've gone from a route that worked fine to throwing a 400, for...what? How much code did we save? Was it worth the pain caused to users?
This is broken. There's no fix for it. Users are on 5.0.0 and I have to support them. The ship has sailed and either monitoring systems _specifically_ don't support this exact version (if the behavior did go back), or we already have to have this fork in our code. For probably a decade.
But what's the bigger issue here? Breaking changes are proposed for a 5.1.0 or (even worse) a 5.0.x in #21417 is very bad. These should absolutely not be allowed. If I have no idea if my stuff will break in the most minor of releases, then I won't upgrade. That's bad for everyone. There are really good reasons that semantic versioning exists and this is a primary example. Please, don't let this happen. Elastic has to take breaking changes seriously if they want people to be on a version anywhere near current. If users can't trust an upgrade, then they won't. Things like this (if deployed) destroy such trust, instead of building it.
I honestly have no idea if Elastic (in general) strives to maintain semantic versioning, but even if that's not the case, can we at least get on the same page with breaking changes? As a developer and a sysadmin, I'm telling you: this is _extremely_ important.
I also backported the docs fixes to 2.4 via a025c9c2eb1564033eb148a8d157ae70849b108c and 1.7 via 6d95de2cdaca419322a8a60a200b8a7dbbd12bf8.
@NickCraver I think we crossed responses, based on my understanding, 5.0.0 is not a unique version from a code perspective, we just missed docs. If it was a unique version from a code behavior perspective, I 1000% agree with you. But it is not. From 5.0.0 onwards, ?all will get rejected.
@kimchy Are you only seeing one of the two issues here? #21417 is not about ?all
. It's _another_ breaking change, pitched for both a minor or even a "bugfix" release.
The ?all
is unsolvable in itself (as mentioned above) because 5.0.0 shipped at all. I still disagree that, given the nature of this API set, it should have been broken at all. That's more of a discussion on breaking monitoring APIs in general - which should happen. Which APIs should be extremely long-term stable? (Without _very_ compelling reasons to break them)
The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers.
Hi @NickCraver
There is a tension here between forcing existing users to adjust to breaking changes and saving new users from hours of debugging when the parameters that they try just don't work. We have had so many issues opened because our APIs silently ignored unknown parameters. We have largely fixed this - we now tell you when you're doing something wrong, and we provide suggestions for which parameters you probably wanted to use. This is a huge step forward in usability.
The all
flag was removed in 1.0, along with the clear
and various metric flags. The bug was that we didn't remove all the documentation about the all
flag, but it hasn't worked since 1.0. I understand your point about how ignoring the all
flag silently in 5.0 wouldn't make a difference, but why is the all
flag different from the other flags? Maybe you are only using the all
flag but there may be others using the other flags, which also haven't worked since 1.0. If we silently ignore those too, we leave users guessing about why things aren't working the way they assume instead of telling them immediately what the problem is.
Regarding the strict checking of metrics in #21417. We have formally embraced strict REST parsing in 5.0 - shouting loud and shouting early saves users so much time and heartache, the benefits are clear. If we find APIs that still accept unknown parameters, it is considered a bug which should be fixed, not a breaking change.
If we continue silently ignoring unknown metrics, we're punishing all of those users who don't understand why their requests aren't working correctly. We can add backwards compatibility and deprecation logging for the percolator
metric, which is no longer available, but how does it help users to ignore unknown metrics? Most users are probably unaware that they have a problem and would be grateful for being told what the problem is. While we could delay this change for 6.0, it would mean another year where users are left in the dark.
The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers.
@ejsmith We're trying hard really hard to balance these two camps: not hurting existing users and making things better for new users. This is not an easy choice. We don't make breaking changes lightly and when we do, it is with the intention of moving to more stable APIs. The breaking changes between 1.x and 2.0 were hard for everybody, 2.x to 5.0 should be a much easier experience. We don't always get it right but please be aware that a lot of soul searching goes into these decisions.
it is considered a bug which should be fixed, not a breaking change
These are orthogonal issues, and not at all mutually exclusive. Any change can be both, as is the example here. That doesn't change the fact that breaking changes should be reserved for major versions. Yet repeatedly, with this project, they are not. This contributes greatly to people not upgrading. It's one of the reasons we don't upgrade often here.
This is not an easy choice. We don't make breaking changes lightly
I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it. I really wish I had the tab open with a screenshot (maybe someone else still does). If you want to carefully consider changes and what goes into things that's fine, but saying that happened here is bull. The fact that comment was edited shows it's recognized.
Elasticsearch is your project. Do what you want with it. But don't assume this has zero impact. If I have to weigh heavily testing every release for breaks because Elastic doesn't care about semantic versioning and is therefore shifting the burden to me (and every other user), then it's worth it one again to spend equal time evaluating other options all together. Would we move to Solr here? I have no idea, but it's absolutely worth investigating again, because the time factor went from low and known to completely unknown and significant. When we upgrade Stack Overflow to a new bugfix release, will it break production? Because you're "helping" me with these breaking changes? Well now we have no confidence in that.
The point of versioning and semantic versioning is that, as a user, I know what I am getting. If it's a x.0.0 upgrade, I know there will be breaks. If there are breaks hiding in build or minor releases, then I a) have to test far more heavily for every change, and b) now expect surprises when we deploy production. That's unhealthy, at best. Semantic versioning exists for extremely good reasons. I'm really not seeing that anyone at Elastic gets that in this issue. Instead, the comments and actions here indicate the exact opposite.
I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it.
I agree that was a poorly considered comment which doesn't reflect the thought process here. Hence the edit.
And your comments are appreciated and heard. They have, for instance, made me reconsider my decision to backport a change to the CAT api to 5.1 - it's now been reverted and will be in 6.0 only.
I get something like that.
Most helpful comment
@jasontedor Honest question, do you just not care about breaking users? Does Elastic in general not care a bit about semantic versioning? I'm honestly trying to gauge what the intent is here. From a consumer point of view this is both hostile and toxic. As a maintainer of a project people consume, this is just all around bad.
There's nothing harmful about these endpoints right now. They are experiencing the same behavior from the day they deployed on 5.0. If they upgrade to 5.1.0 with a change like this, you're breaking them. Why? What good does that do? There's good reason breaking changes are reserved for major versions. Most of the software world understands this.