Elasticsearch: Elasticsearch 5.0: _nodes/stats?all throws a 400

Created on 8 Nov 2016 · 30Comments · Source: elastic/elasticsearch

Elasticsearch version: 5.0.0 (253032b)
Plugins installed: none
JVM version: 1.8.0_102
OS version: Linux 4.4.20-moby (vanilla 5.0 Docker container)

Description of the problem including expected versus actual behavior:
Calling _nodes/stats?all should return all stats populated, as it did in previous versions and is still documented to work as in the Node Stats documentation:

The all flag can be set to return all the stats.

Instead, it returns a 400:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "request [/_nodes/stats] contains unrecognized parameter: [all]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "request [/_nodes/stats] contains unrecognized parameter: [all]"
  },
  "status" : 400
}

Steps to reproduce:

Navigate to /_nodes/stats?all, e.g. http://localhost:9200/_nodes/stats?all
Get a 400
Go to nearest corner
Assume fetal position
Cry a little

Can we restore this functionality please? Given the lack of a documentation update, I'm hoping this was an unintentional and completely unnecessary breaking change.

:CorFeatureStats >docs

Source

NickCraver

Most helpful comment

@jasontedor Honest question, do you just not care about breaking users? Does Elastic in general not care a bit about semantic versioning? I'm honestly trying to gauge what the intent is here. From a consumer point of view this is both hostile and toxic. As a maintainer of a project people consume, this is just all around bad.

There's nothing harmful about these endpoints right now. They are experiencing the same behavior from the day they deployed on 5.0. If they upgrade to 5.1.0 with a change like this, you're breaking them. Why? What good does that do? There's good reason breaking changes are reserved for major versions. Most of the software world understands this.

NickCraver on 8 Nov 2016

👍17 ❤1

All 30 comments

At a glance, it looks related to #21266 (see https://github.com/elastic/elasticsearch/blob/5.0/core/src/main/java/org/elasticsearch/rest/action/admin/cluster/RestNodesStatsAction.java#L64)

nlowe on 8 Nov 2016

By default, node stats returns all stats. You can explicitly request all stats by hitting /_nodes/stats/_all or /_nodes/stats?metric=_all. Requesting /_nodes/stats?all never did anything, it was just silently ignored in the past. I'll clarify the docs.

jasontedor on 8 Nov 2016

Requesting /_nodes/stats?all never did anything, it was just silently ignored in the past

Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and all used to have a purpose. But that is ancient history

clintongormley on 8 Nov 2016

@jasontedor Was this the case in all previous versions? Monitoring systems that hit the APIs need to span many versions, and every breaking change like this (like changing hostname to host) create real unnecessary pain points.

This breaks our monitoring system. And anyone else's spanning many versions. Why break this? Why was it necessary? Just ignore the param if the behavior would be the same.

NickCraver on 8 Nov 2016

Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and all used to have a purpose. But that is ancient history

It's like this in 1.x and 2.x as well, ?all did nothing, it was just silently ignored.

jasontedor on 8 Nov 2016

Was this the case in all previous versions?

It was the case in 1.x and 2.x for sure, I just checked the code.

Why break this? Why was it necessary?

It didn't do anything, it was silently ignored.

Just ignore the param if the behavior would be the same.

This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't.

jasontedor on 8 Nov 2016

I pushed the following commits:

master: aec09a76d64a11f1eeeb6c7a637a775abbe68dda
5.x: f10e26a832f9c1bd67254b8eec23a173165f83f1
5.0: b3df73f172ca688496db6407552bada08b252f11

jasontedor on 8 Nov 2016

This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't.

It _was_ doing something in old versions. If you think no one is still using 0.90, that's just way off base. Unfortunately, that's reality. And they're using some of our stuff to monitor it. So now, for no good reason, instead of maintaining minimal code in Elastic to ignore it, every monitoring system has to maintain a code fork and version awareness to adapt to this. That's not an efficient use of time for anyone.

This is happening with every release, and none of them were necessary. Please, quit breaking things for very minor cleanups. It's just shifting a lot of work to n consumers.

NickCraver on 8 Nov 2016

So we're just supposed to keep all of our mistakes forever? Sorry @NickCraver but no. Strict parsing of query string parameters is a really important step forwards, and we're not going to keep cruft forever. We try hard not to break things. Sometimes we do it unintentionally without realising that something will be an issue, for which I apologise, but it is untenable to keep everything the same forever. It would make our code so creaky and complicated we wouldn't be able to touch it without fear of breaking things.

clintongormley on 8 Nov 2016

👎1

It was doing something in old versions. If you think no one is still using 0.90, that's just way off base.

That's right, very old versions, and versions that we do not support anymore.

Please, quit breaking things for very minor cleanups.

This change was made almost three years ago in bb275166f1a1f20f5330691ab854d6dc61c51366. The related issue is #4057.

The addition of strict REST parsing was a major, nor minor, step forward for Elasticsearch. The related PR is #20722.

jasontedor on 8 Nov 2016

So we're just supposed to keep all of our mistakes forever?

No. But not acknowledging the APIs that _need_ to span versions are more special and should have a much higher bar here shows a large disconnect with what consumes these APIs. I never said _everything_ should work forever. I am speaking about a very narrow scope of APIs here.

Instead of you maintaining support, now _everyone else_ has to deal with it. And all monitoring systems have to have a code fork here to handle it. Is that really the best use of human time?

Also note: this isn't ancient history. It was documented to work an hour ago.

NickCraver on 8 Nov 2016

Note there is no "strict REST parsing" for the _nodes endpoint.

 curl '0:9200/_nodes/stats/foobar'

gives a (unexpected) response on an undefined metric "foobar" and is not rejected as invalid request.

jprante on 8 Nov 2016

@jprante Strict REST parsing (#20722) was about rejecting unrecognized URL parameters, what you're referring to is a separate issue and should be addressed.

jasontedor on 8 Nov 2016

@jprante Thanks for reporting this, I have a fix for this and will make a formal PR shortly.

jasontedor on 8 Nov 2016

@jasontedor You shouldn't make _another_ breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond?

NickCraver on 8 Nov 2016

Also note: this isn't ancient history. It was documented to work an hour ago.

It was a documentation bug and now it's fixed.

jasontedor on 8 Nov 2016

You shouldn't make _another_ breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond?

It's consumers that are using the stats endpoints with invalid metrics that are broken but are going undetected because we are lenient here. The change would be a bug fix, and I'm definitely not holding it until a major version. It's tempting to even include it in a bug fix release, but it's definitely going into 5.1.0.

jasontedor on 8 Nov 2016

Thanks for reporting @jprante, I opened #21417.

jasontedor on 8 Nov 2016

NickCraver on 8 Nov 2016

👍17 ❤1

the fact we can now strictly parse parameters is a huge improvement to users, since it allows us to be properly consistent and predictable with our APIs. Something, as anything with writing software, I personally wish I should have done with the get go.

On the other hand, I do appreciate the implications when it comes to backward comp. I would like to hear how you use it, to make sure it is not broken already, and actually, this fix causes things to be fixed? Our goal it _not_ to break backward comp. but I want to understand and appreciate the balance between wrong invocation of API due to mis-design on end that gets fixed, vs. really breaking correct existing behavior? @NickCraver can you share more info?

kimchy on 9 Nov 2016

hey @NickCraver , looking into it, we did make a braking change, and made ?all a breaking change from 5.0._0_ forward, we just missed the documentation aspect, which we fixed. Strict URL parsing is important for resiliency in our codebase, so regardless, you would need to change 5.x code to support it.

I view strict URL parsing important, and regardless of the docs, which we missed, and I apologize for that, the _code_ still rejects the ?all flag, from 5.0.0 onwards, which is good. And any monitoring tool working against 5.x versions need to adjust to it.

p.s. I am on a EU TZ, so can't respond soonish moving forward, please don't view me being in-responsives as a non comment.

kimchy on 9 Nov 2016

@kimchy So context here: there is no functionality fix here, the documentation was retroactively changed to "fix" a breaking change. That breaking change was in a major release (good thing), though IMO unnecessary (distinct issue). It should be added to breaking change docs regardless. There's now _another_ change proposed in #21417 that's breaking _not_ in a major version change (bad). That's an issue in itself, and just bad behavior for any project.

The documentation fix is appreciated, but what are we fixing here? The strict parameter parsing is good, but would adding this as a recognized parameter and maintaining behavior be a bad thing? Then when a monitoring system hits these routes, it's not version switched per version. When things like this break, I'm not only fixing our monitoring system (both Bosun and Opserver), everyone else is also doing this. And that's bad, because this break wasn't really necessary.

No one's asking that we revert strict parameter parsing, the ask if much smaller in scope: recognize this parameter. It was allowed and documented to be allowed past the 5.x release (the live docs still have it).

More globally, the monitoring APIs in general are the thing most likely to need many-version support at once. Any changes that aren't _really_ necessary here are painful, multiplied by the number of systems hitting them. As a user, what's the benefit here? How many lines of code would it take in elasticsearch to make this work? The current version would need to _just ignore it_. Is that a big ask? What's happening right now is we've gone from a route that worked fine to throwing a 400, for...what? How much code did we save? Was it worth the pain caused to users?

This is broken. There's no fix for it. Users are on 5.0.0 and I have to support them. The ship has sailed and either monitoring systems _specifically_ don't support this exact version (if the behavior did go back), or we already have to have this fork in our code. For probably a decade.

But what's the bigger issue here? Breaking changes are proposed for a 5.1.0 or (even worse) a 5.0.x in #21417 is very bad. These should absolutely not be allowed. If I have no idea if my stuff will break in the most minor of releases, then I won't upgrade. That's bad for everyone. There are really good reasons that semantic versioning exists and this is a primary example. Please, don't let this happen. Elastic has to take breaking changes seriously if they want people to be on a version anywhere near current. If users can't trust an upgrade, then they won't. Things like this (if deployed) destroy such trust, instead of building it.

I honestly have no idea if Elastic (in general) strives to maintain semantic versioning, but even if that's not the case, can we at least get on the same page with breaking changes? As a developer and a sysadmin, I'm telling you: this is _extremely_ important.

NickCraver on 9 Nov 2016

I also backported the docs fixes to 2.4 via a025c9c2eb1564033eb148a8d157ae70849b108c and 1.7 via 6d95de2cdaca419322a8a60a200b8a7dbbd12bf8.

jasontedor on 9 Nov 2016

@NickCraver I think we crossed responses, based on my understanding, 5.0.0 is not a unique version from a code perspective, we just missed docs. If it was a unique version from a code behavior perspective, I 1000% agree with you. But it is not. From 5.0.0 onwards, ?all will get rejected.

kimchy on 9 Nov 2016

@kimchy Are you only seeing one of the two issues here? #21417 is not about ?all. It's _another_ breaking change, pitched for both a minor or even a "bugfix" release.

The ?all is unsolvable in itself (as mentioned above) because 5.0.0 shipped at all. I still disagree that, given the nature of this API set, it should have been broken at all. That's more of a discussion on breaking monitoring APIs in general - which should happen. Which APIs should be extremely long-term stable? (Without _very_ compelling reasons to break them)

NickCraver on 9 Nov 2016

👍2

The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers.

ejsmith on 9 Nov 2016

Hi @NickCraver

There is a tension here between forcing existing users to adjust to breaking changes and saving new users from hours of debugging when the parameters that they try just don't work. We have had so many issues opened because our APIs silently ignored unknown parameters. We have largely fixed this - we now tell you when you're doing something wrong, and we provide suggestions for which parameters you probably wanted to use. This is a huge step forward in usability.

The all flag was removed in 1.0, along with the clear and various metric flags. The bug was that we didn't remove all the documentation about the all flag, but it hasn't worked since 1.0. I understand your point about how ignoring the all flag silently in 5.0 wouldn't make a difference, but why is the all flag different from the other flags? Maybe you are only using the all flag but there may be others using the other flags, which also haven't worked since 1.0. If we silently ignore those too, we leave users guessing about why things aren't working the way they assume instead of telling them immediately what the problem is.

Regarding the strict checking of metrics in #21417. We have formally embraced strict REST parsing in 5.0 - shouting loud and shouting early saves users so much time and heartache, the benefits are clear. If we find APIs that still accept unknown parameters, it is considered a bug which should be fixed, not a breaking change.

If we continue silently ignoring unknown metrics, we're punishing all of those users who don't understand why their requests aren't working correctly. We can add backwards compatibility and deprecation logging for the percolator metric, which is no longer available, but how does it help users to ignore unknown metrics? Most users are probably unaware that they have a problem and would be grateful for being told what the problem is. While we could delay this change for 6.0, it would mean another year where users are left in the dark.

The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers.

@ejsmith We're trying hard really hard to balance these two camps: not hurting existing users and making things better for new users. This is not an easy choice. We don't make breaking changes lightly and when we do, it is with the intention of moving to more stable APIs. The breaking changes between 1.x and 2.0 were hard for everybody, 2.x to 5.0 should be a much easier experience. We don't always get it right but please be aware that a lot of soul searching goes into these decisions.

clintongormley on 9 Nov 2016

it is considered a bug which should be fixed, not a breaking change

These are orthogonal issues, and not at all mutually exclusive. Any change can be both, as is the example here. That doesn't change the fact that breaking changes should be reserved for major versions. Yet repeatedly, with this project, they are not. This contributes greatly to people not upgrading. It's one of the reasons we don't upgrade often here.

This is not an easy choice. We don't make breaking changes lightly

I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it. I really wish I had the tab open with a screenshot (maybe someone else still does). If you want to carefully consider changes and what goes into things that's fine, but saying that happened here is bull. The fact that comment was edited shows it's recognized.

Elasticsearch is your project. Do what you want with it. But don't assume this has zero impact. If I have to weigh heavily testing every release for breaks because Elastic doesn't care about semantic versioning and is therefore shifting the burden to me (and every other user), then it's worth it one again to spend equal time evaluating other options all together. Would we move to Solr here? I have no idea, but it's absolutely worth investigating again, because the time factor went from low and known to completely unknown and significant. When we upgrade Stack Overflow to a new bugfix release, will it break production? Because you're "helping" me with these breaking changes? Well now we have no confidence in that.

The point of versioning and semantic versioning is that, as a user, I know what I am getting. If it's a x.0.0 upgrade, I know there will be breaks. If there are breaks hiding in build or minor releases, then I a) have to test far more heavily for every change, and b) now expect surprises when we deploy production. That's unhealthy, at best. Semantic versioning exists for extremely good reasons. I'm really not seeing that anyone at Elastic gets that in this issue. Instead, the comments and actions here indicate the exact opposite.

NickCraver on 9 Nov 2016

👍2

I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it.

I agree that was a poorly considered comment which doesn't reflect the thought process here. Hence the edit.

And your comments are appreciated and heard. They have, for instance, made me reconsider my decision to backport a change to the CAT api to 5.1 - it's now been reverted and will be in 6.0 only.

clintongormley on 9 Nov 2016

I get something like that.