Elasticsearch: Introduce a new `_last_modified` metafield

Created on 11 Oct 2016  路  15Comments  路  Source: elastic/elasticsearch

I had some discussions with @niemyjski and @ejsmith on #17895 about _timestamp, it's guarantees / semantics and how it was used vs. it's purpose. Yet, it became obvious to me that the semantics that are expected from the users perspective is somewhat a last_modified timestamp that is really updated every time the document is indexed on the primary shard. This implies, the value of this fields is updated:

  • when reindexed via _reindex
  • when updated (with script or not)
  • when indexed via the API

the field should be very clear in it's semantics such that it can NOT be provided or modified from the outside. Yet, what is still somewhat unclear to me is if it should be indexed, stored and / or have docvalues. I think when we add this we want the configuration options to be as small as possible maybe only enabled: true|false? For this to happen we need to know the usecases and I was hoping @niemyjski and @ejsmith could help with this to design this new field with clear semantics?

:CorInfrCore >enhancement v6.0.3

Most helpful comment

I really don't care that much what the name is, but that is the same name used in just about every other db.

a friendly reminder, es is not a database

All 15 comments

Yet, what is still somewhat unclear to me is if it should be indexed, stored and / or have docvalues.

To expand this for those not super familiar with all the terms and their implications:

If the field is not indexed then you can't search for it using range or match or something like that. You'd still be able to search for it using a script query if it has docvalues but that'd be super slow unless paired with a super selective query.

If the field has doc values then you can aggregate on it or use it in a script query or a function_score query or a script field as doc['_last_modified']. You _could_ also get it back in the response as well if we implemented that.

If the field is stored then you can get it back in the response. If the field doesn't have docvalues and isn't stored then you can't ever get it back in the response which is a bit silly for this kind of field. If the field already has doc values then whether or not to also store it is a disk seeks vs storage efficiency tradeoff.

The primary use case is for doing things like online reindex and bulk operations. It's unrealistic for us to take our cluster offline while we do a 100gb reindex. So we do this:

  • Alias that the app uses to read and write data
  • Reindex in the background to a new index
  • 2nd pass of reindex where the timestamp is greater than the date that we started the initial reindex. This gets us pretty close to being real-time
  • Swap the alias from the old index to the new one
  • Do one more reindex pass for any documents that may have been modified in the time it took us to swap aliases and do the 2nd pass reindex

So you can see that the timestamp field is really important to this strategy. Also, we use it for other bulk operations as well.

To me _timestamp is the correct name for this feature. "_last_modified" sounds like it's only set when the document is modified, but it should be set when new documents are written as well. I think the core issue with the current _timestamp is that you let it be mapped to the document and modified. I understand that you don't want to break existing people, but you are already breaking them by removing the _timestamp feature. So couldn't we just modify that feature instead to do the right thing?

@ejsmith IMO what you want is something completely different. you want something like a sequence ID that you can resume from to get changes for the reindex. I am really trying to help you with your requirements but for these problems you don't need a timestamp field I guess all you need is some kind of "generation" field that you can just add as an ordinary numeric field that you increment on the client after you started reindexing. Bare with me I am trying to understand your use-case.
As a side-note I wonder how you realize that stuff gets deleted?

To me _timestamp is the correct name for this feature. "_last_modified" sounds like it's only set when the document is modified, but it should be set when new documents are written as well.

_timestamp is 90% of the time user provided (ES does NOT produce it and its NOT consistent about it) that is why we removed support for it since most of the users provide it from the outside. It's the time the event occurred not when it was indexed or anything like that. We can not just go an change semantics, for something that has been around forever that would be dramatic I think since it would create even more confusion.

this issue about _last_modified is to create a consistent timestamp that is set every time a document is created or updated. users can NOT provide this from the outside (major difference including guarantees).

So couldn't we just modify that feature instead to do the right thing?

I think you don't understand the impact of this. There is no right thing here, _timestamp was used for so many things that changing it's behavior is basically asking for trouble. If you remove something like this and you are maintaining such a codebase you want your users to notice if they still use it. IF you are relying on people reading migration guides you are doomed. We have no other choice than deprecating and removing. I don't necessarily understand why you are so resistent here we are basically putting large effort in helping your usecase why are you keeping on pushing along those lines?

I think you are assuming that my documents are immutable. They are not, they can be modified. As far as deletes go, I use soft deletes and then have a job that goes and really deletes them later. So I do not run that job while doing a reindex.

They would notice that you changed it just the same way that they would notice that you removed the feature. If you make it readonly and unmappable their app would break since the mapping isn't allowed any more. So it's a breaking change that they would have to accommodate either way. I really don't care that much what the name is, but that is the same name used in just about every other db.

Personally I think _last_modified is a much more clear name than _timestamp. When I see _timestamp I always think "timestamp of what".

that is the same name used in just about every other db

I believe MySQL's timestamp defaults to now() and automatically updates to now every time you touch the row. I admit I don't remember it too well. I'm fairly sure that PostgreSQL, Oracle, MongoDB don't default timestamps to now(). You can get them to default to now or do the update on touch thing with some configuration I think. I'm sure there are others systems, but I don't think it is worth picking a worse name to align MySQL and other systems that behave like it.

Also I think in cases where we don't do _exactly_ the same thing it is a mistake to use some other system's feature as a namesake. I'm fairly sure we don't want to commit to aligning with MySQL's timestamp functionality though I admit I don't know the extent of it.

I don't care about the name too much. :-) I just want a reliable way of retrieving documents that have been modified after a date.

I just want a reliable way of retrieving documents that have been modified after a date.

Fair enough. I think its pretty clear that we should index _last_modified. It is useful for slicing things up like you said and for limiting search results.

My instinct is that it should also have docvalues so it is sortable but that is just an instinct.

I really don't care that much what the name is, but that is the same name used in just about every other db.

a friendly reminder, es is not a database

If you make it readonly and unmappable their app would break since the mapping isn't allowed any more.

This doesn't make sense. You're using soft deletes, so you modify the document to set deleted: true which means that the _last_modified date is updated. All works exactly as you need.

As far as naming goes, _timestamp is the wrong name. It could mean:

  • when was the document created
  • when was the document modified
  • when was the original event that resulted in document creation
  • when has a particular type of event occurred

On the other hand, _last_modified is very explicit about its meaning. Of course document creation counts as a modification. Why wouldn't it?

(Also note that your use of soft deletes puts you into a tiny minority of users, yet you want this badly named _timestamp field to work in exactly the way you want for your use case. Other users have the same expectation)

Simon is right in saying that the feature that you actually need for this style of reindexing is sequence IDs, which is a feature that we're working on but which won't land in 5.x. A _last_modified field is a reasonable (but not great) proxy for that feature. However I think adding a _last_modified field will be useful even once we have sequence IDs, so I'm +1 on doing it.

Like I said, I am fine with the name. I'm not interested in adding anything that is specific to my use case. I stated my problem and I think it's a pretty common thing to want to do online reindexing.

I'm not sure what you guys mean by sequence id. Is that the ability to resume scan and scroll or something?

I'm not sure what you guys mean by sequence id. Is that the ability to resume scan and scroll or something?

Every operation is assigned a sequence ID so that you can replay operations from a certain point onwards. See https://github.com/elastic/elasticsearch/issues/10708

It will enable us to implement a changes API (https://github.com/elastic/elasticsearch/issues/1242)

Yes, that would be awesome.

It seems that sequence IDs and the changes API would solve the use cases proposed here indeed. I am closing this issue pending additional use cases that would compel us to consider adding a _last_modified field.

@jasontedor Do you have an example of how one would query by change changes api to see all changes since x time?

@niemyjski The changes API is not available today (#1242). You would not query for changes since a given time, but rather all changes above a certain sequence number.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dadoonet picture dadoonet  路  3Comments

jpountz picture jpountz  路  3Comments

malpani picture malpani  路  3Comments

rjernst picture rjernst  路  3Comments

matthughes picture matthughes  路  3Comments