Elasticsearch: ILM resiliency to deleting rollover alias

Created on 30 Apr 2019 · 17Comments · Source: elastic/elasticsearch

ILM allows for simple index lifecycle management. In 7.0 ILM is the default for all Beats/Logstash and out of the box supports a basic roll over policy. While it is nice that in 7.0 everything just works out of the box for Beats and Logstash, it also hides some of implementation details that are required for the rollover policy to work. Specifically the bootstrap requirements for ILM requires setting up of the policy, index template, and alias. Beats kindly does all of this for the user without their knowledge.

However as a user, it is quite easy to not be aware of the setup work that Beats and Logstash performs to ensure that ILM works properly. This makes it quite easy for a user (especially during the getting started phase) to accidentally delete the write alias with a simple DELETE mybeat*. This, I assume, is a common way to delete data while you are getting things setup. Deleting the write alias, for Beats/Logstash AND leaving the beats running will result in the creation of a concrete index with the alias name without any ILM policy attached. For user un-aware of the details of ILM setup requirements this can appear as a bug that ILM no longer works.

Here are reasonable steps by a user:

Install and run a Beat
Check Elasticsearch data (not quite right)
Tweak the Beat configuration (while running and/or restart)
Delete the old beat data via DELETE mybeat*
Observer/query the data against mybeat-7.0.0

The problem above is that data is being written to a concrete index mybeat-7.0.0, not an alias ! mybeat-7.0.0 does not match the index pattern defined in the template (mybeat-7.0.0-*) so the ILM policy is not applied.

While the removal of the ILM is due to a user's action, those actions are reasonable. It should not be expected that user understands that Beats by default will setup the write alias and writes through that alias and if you delete data you need to ensure that you only delete the concrete indexes behind the alias.

One idea proposed by @andrewvc would be create a new option for indexing that requires writing through an alias, else fail with a meaningful message. This would allow Beats/Logstash (with code changes) an opportunity to recreate the alias to avoid creating a concrete index as well as check the other bootstrap requirements still exist.

I believe this is also tangentially related to https://github.com/elastic/elasticsearch/issues/37880 and https://github.com/elastic/elasticsearch/issues/35211

:CorFeatureILM+SLM CorFeatures

Source

jakelandis

❤12 👍4

Most helpful comment

Just FYI on this thread, here's an example of how to recover the lost alias in an atomic operation (data written to the accidental concrete index will be lost):

PUT metricbeat-7.5.1-000001
POST /_aliases
{
    "actions" : [
        { "add":  { "index": "metricbeat-7.5.1-000001", "alias": "metricbeat-7.5.1" } },
        { "remove_index": { "index": "metricbeat-7.5.1" } }  
    ]
}

PhaedrusTheGreek on 27 Jan 2020

❤1 👍1

All 17 comments

Pinging @elastic/es-core-features

elasticmachine on 30 Apr 2019

Here's what I wish we had, to make this easier for everyone (not just beats):

// A policy with rollover that will require an alias
PUT _ilm/policy/roll
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "10d"
          }
        }
      }
    }
  }
}

PUT _template/logs
{
  "index_patterns": ["logs-*"],
  "settings": {
    "index.lifecycle.name": "roll",
    "index.lifecycle.rollover_alias": "roll-alias"
  },
  "aliases": {
    "roll-alias":{
      "is_write_index": true,
      // This setting
      "only_create_on_first_index": true
    }
  }
}

What only_create_on_first_index would do is only apply this alias
creation/addition if this is the very first index to be created. If there is
already an index that matches this pattern, then the roll-alias section is
ignored.

I'm not sure whether it's technically feasible to implement it like this, but
this would allow us not to have to have a special "bootstrapping" for the first
index which seems to trip a lot of people up with ILM.

dakrone on 30 Apr 2019

@dakrone I like that proposal, but unless I misunderstand I don't think it solves the problem we see on beats. We want our clients to fail when writing to, say heartbeat-8.0.0, which is supposed to be an alias, if the index has been created.

The client needs to be a able to say "I think I'm writing to an ILM alias, if that is not the case, please fail the write and let me know so I can fix the ILM alias".

andrewvc on 30 Apr 2019

@bleskes and I chatted about adding an option to prevent auto-creating an index if the index/alias name was missing. eg

PUT foo/_doc?auto_create_index=false
{...}

This would then throw an exception if the foo index or alias doesn't exist.

clintongormley on 6 May 2019

👍1

That solution is the best one here so far. Would be curious what @urso and the rest of @elastic/beats-core thinks.

The beats code would wind up something like:

Attempt bulk index
If items fail due to no autocreate
Retry ILM setup
If that fails enter a failed state

Should that be standard guidance for users?

andrewvc on 6 May 2019

Thinking about it more is there a way we could autocreate the alias if it's deleted? This is a lot of logic to push onto the client, and before we know it every client will implement the same boilerplate or be buggy.

What if the template could apply to the autocreate logic internal to ES? Then things would work like they did historically and users would have a simpler experience.

andrewvc on 6 May 2019

Thinking about it more is there a way we could autocreate the alias if it's deleted? This is a lot of logic to push onto the client, and before we know it every client will implement the same boilerplate or be buggy.

No, let's not push any complicated logic into the client. We're looking at a world where setup happens in Fleet in Kibana. If the user deletes the alias, then beats should stop sending and throw errors. The user would have to set up the indices again via Fleet.

clintongormley on 6 May 2019

auto_create_index=false would do the trick to some extend. We would detect if ILM is available and setup by Beats on startup and only use this parameter if that's the case.

There is still the chance of having conflicts, though. Beats uses the bulk API. If we have this as HTTP parameter, then it must be true for every index in a bulk request. In case of Filebeat the index name is often generated based on event contents (from JSON files). This would force us to create separate batch types (managed, vs. unmanaged batches) or
Beats always have to ensure an index exists before publishing to it.

In general settings via HTTP parameters can be annoying, and we already have to jump through some hoops for sending separate bulk requests to xpack monitoring. Current workaround in Beats is to just create two pipeline instances for grouping queuing-batching-retry-indexing logic by potential HTTP parameters...

urso on 10 May 2019

@jakelandis have any plans been made in terms of scheduling this auto_create flag? We still see this somewhat regularly in the wild.

andrewvc on 3 Jul 2019

@andrewvc we haven't started on it, but we're currently going through the backlog and prioritizing things for working on, this is one of the issues included in that backlog.

dakrone on 3 Jul 2019

Is there any resolution to this ? I followed this from a google search -> forum -> to Issue https://github.com/elastic/beats/issues/11940 which referenced this one.

Summary : when viewing "Logs" :
```An error occurred
Failed to load data sources.

Try again
Error: GraphQL error: [illegal_argument_exception] Fielddata is disabled on
text fields by default. Set fielddata=true on [event.dataset] in order to load
fielddata in memory by uninverting the inverted index. Note that this can
however use significant memory. Alternatively use a keyword field instead.
```
I disagree with the sentiment that this is bad a user experience - as a user/consumer (and engineer as well) - I classify this as a BUG and REGRESSION from 7.4.x. The setup behavior changed, and there is no documentation with how to configure it. OR how the heck to I fix this? In fact, the getting-started guides have NOTHING referencing requiring ILM policies to get things initially up and running.

This is a clean, fresh install of ELK ... Elastic Search <- Logstash <- filebeats + kibana install of 7.5.1.

I did not have this issue when testing with 7.4.x. I DID NOT DELETE ANY INDEXES. The use-case/user-story of the issue documented is incorrect.

Right now ... I'm not looking for a patch from eng, but the documentation or the correct procedure to fix this error. (See screen shot as well.)

Screen Shot 2019-12-29 at 10 06 31 PM

care2DavidDPD on 30 Dec 2019

👍1

Is there any resolution to this ? I followed this from a google search -> forum -> to Issue elastic/beats#11940 which referenced this one.

Though, elastic/beats#11940 was referenced here, I might be mixing issues. This still seems like a regression (I don't remember doing this with a clean 7.4.x install.) Either way, in kibana -> management -> Elasticsearch, there should be a push button to fix this.

And curious, Filebeat is actually sending to Logstash, and logstash to ES (mainly because timestamp needed to be the log time, not the received time, which didn't seem possible with Filebeat).

https://gist.github.com/care2DavidDPD/6f0a211d427aa8f6798c901278f738f5

care2DavidDPD on 30 Dec 2019

@care2DavidDPD While there's clearly a problem here, I think what you're hitting is different than the problems described in this issue.

The problems that occur when the alias is deleted have to do with ILM not managing indices as expected when the alias is deleted, particularly rollover, which can result in indices growing very large.

The error message in your post/screenshot is one that's returned by Elasticsearch when a query is made that Elasticsearch isn't able to process efficiently due to the mapping of the index. That might mean there's a problem with the query Kibana is making, or it might mean there's an issue where the index mapping created by Filebeat isn't correct. Unfortunately, there's not enough information in your post to determine which of those it is - you'll likely have to collect some logs to diagnose this, but we'd like to keep this issue focused on discussion of the usability bug described in the first post.

My recommendation would be to make a post on the Beats forums or the Kibana forums (feel free to link this comment), where someone should be able to help you find a solution. Once we have a better understanding of what happened and whether it's more related to Kibana or Beats, then we should create a new issue on the appropriate repo for any changes necessary to fix this going forward, whether that ends up being code changes, documentation, whatever.

gwbrown on 6 Jan 2020

We fixed this through Twitter and it's indeed a different issue. We can focus on the original ILM topic here :)

xeraa on 6 Jan 2020

Just FYI on this thread, here's an example of how to recover the lost alias in an atomic operation (data written to the accidental concrete index will be lost):

PUT metricbeat-7.5.1-000001
POST /_aliases
{
    "actions" : [
        { "add":  { "index": "metricbeat-7.5.1-000001", "alias": "metricbeat-7.5.1" } },
        { "remove_index": { "index": "metricbeat-7.5.1" } }  
    ]
}

PhaedrusTheGreek on 27 Jan 2020

❤1 👍1

Just FYI on this thread, here's an example of how to recover the lost alias in an atomic operation (data written to the accidental concrete index will be lost):
PUT metricbeat-7.5.1-000001
POST /_aliases
{
    "actions" : [
        { "add":  { "index": "metricbeat-7.5.1-000001", "alias": "metricbeat-7.5.1" } },
        { "remove_index": { "index": "metricbeat-7.5.1" } }  
    ]
}

This is gold. A few times I've had an issue where many client beats (journalbeat, filebeat etc) are running and are writing to what should be an alias, but the alias was deleted and now they're writing to the concrete index. Cue me frantically attempting to delete said index, and create the alias before my hundreds of clients have had a chance to write to the concrete index yet again.