Elasticsearch: Creating a snapshot fails at the very last stage with NullPointerException

Created on 22 Apr 2018  路  9Comments  路  Source: elastic/elasticsearch

Hi,

Creating a snapshot of a single index reports it finished correctly through _snapshot/_status ("state": "SUCCESS"), but a NullPointerException is thrown and the snapshot eventually does not show up in the repository.

Elasticsearch version:
6.2.3

Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3

JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
We've recently upgraded our elasticsearch cluster from v5.2.2 to v6.2.3, running both clusters in parallel and used the snapshots feature on Azure blob storage to replicate the data to the new cluster. After migration, new snapshot creation tasks are executed daily for backup from the new v6.2.3 cluster onto the same repositories. Initiating the create snapshots succeeds, taking the expected amount of time to copy the data to Azure and even reporting "state": "SUCCESS" on _snapshot/_status:

{
"snapshots": [
{
"snapshot": "2_newlogs_20180328",
"repository": "2_newlogs",
"uuid": "ds3cBPQOSpqIOt53-ImKBw",
"state": "SUCCESS",
"include_global_state": false,
"shards_stats": {
"initializing": 0,
"started": 0,
"finalizing": 0,
"done": 2,
"failed": 0,
"total": 2
},
"stats": {
"number_of_files": 147,
"processed_files": 147,
"total_size_in_bytes": 12541757401,
"processed_size_in_bytes": 12541757401,
"start_time_in_millis": 1524411697845,
"time_in_millis": 462463
},
"indices": {
"2_newlogs_20180328-01": {
"shards_stats": {
"initializing": 0,
"started": 0,
"finalizing": 0,
"done": 2,
"failed": 0,
"total": 2
},
"stats": {
"number_of_files": 147,
"processed_files": 147,
"total_size_in_bytes": 12541757401,
"processed_size_in_bytes": 12541757401,
"start_time_in_millis": 1524411697845,
"time_in_millis": 462463
},
"shards": {
"0": {
"stage": "DONE",
"stats": {
"number_of_files": 63,
"processed_files": 63,
"total_size_in_bytes": 6270745558,
"processed_size_in_bytes": 6270745558,
"start_time_in_millis": 1524411697845,
"time_in_millis": 439319
}
},
"1": {
"stage": "DONE",
"stats": {
"number_of_files": 84,
"processed_files": 84,
"total_size_in_bytes": 6271011843,
"processed_size_in_bytes": 6271011843,
"start_time_in_millis": 1524411697863,
"time_in_millis": 462445
}
}
}
}
}
}
]
}

After the create snapshot finishes (_snapshot/_status returns an empty array), the snapshot does not show up in the repository and a NullPointerException is thrown in the master node logs:

[2018-04-22T15:49:20,567][WARN ][o.e.s.SnapshotsService ] [prod-elasticsearch-master-003] [2_newlogs:2_newlogs_20180328/ds3cBPQOSpqIOt53-ImKBw] failed to finalize snapshot
java.lang.NullPointerException: null
at org.elasticsearch.repositories.RepositoryData.snapshotsToXContent(RepositoryData.java:341) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeIndexGen(BlobStoreRepository.java:674) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.finalizeSnapshot(BlobStoreRepository.java:470) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.snapshots.SnapshotsService.lambda$endSnapshot$3(SnapshotsService.java:974) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

This has been replicated with multiple create snapshots operations on multiple repositories.
Additionally, deleting the repository (without individually deleting the snapshots it contains) and re-creating it (thereby loading the available snapshots from 5.2.2) did not solve the problem.

I'd be happy to provide any additional information as needed.

Thanks!

:DistributeSnapshoRestore >bug

Most helpful comment

Thanks @redlus, the description of your problem helped me to find a reproducible scenario where such NPE is thrown. See #30140 for some boring information :)

All 9 comments

Thanks @redlus. It looks similar to other issue we saw. I'm going to look closer at this.

Thanks. Is there an open issue for what you mentioned?

Pinging @elastic/es-distributed

@tlrx a quick update:
I've deleted all the repositories from elasticsearch, and created new (empty) repositories with the same name but pointing to a different container name within the Azure blob storage account. Snapshots are created, restored and deleted without errors.

This seems to indicate that repositories and snapshots created with elasticsearch 5.2.2 have some collision with create / restore / delete snapshot commands run on the same repos from elasticsearch 6.2.3.

This seems to indicate that repositories and snapshots created with elasticsearch 5.2.2 have some collision with create / restore / delete snapshot commands run on the same repos from elasticsearch 6.2.3.

Just to double check, does it mean that two clusters in different versions where writing to the same repository (Azure container) at the same time?

There were two phases. First, v5.2.2 successfully created snapshots and v6.2.3 successfully restored snapshots. Then the old v5.2.2 cluster was taken down, and v6.2.3 was now trying to create snapshots onto the same repo - resulting in the NullPointerException above.

This seems to indicate that repositories and snapshots created with elasticsearch 5.2.2 have some collision with create / restore / delete snapshot commands run on the same repos from elasticsearch 6.2.3.

Thanks for your responses.

In your case, you snapshotted data from 5.2.2 and restored them on 6.2.3. We usually advise to upgrade to the latest 5.x version first and then move to the next major version (see https://www.elastic.co/guide/en/elasticsearch/reference/6.2/setup-upgrade.html).

Also, according to the the documentation:

The snapshot format can change across major versions, so if you have clusters on different major versions trying to write the same repository, new snapshots written by one version will not be visible to the other. While setting the repository to readonly on all but one of the clusters should work with multiple clusters differing by one major version, it is not a supported configuration.

It means that the repository created using the 5.2.2 cluster must only be accessed by this 5.2.2 cluster for creating or deleting snapshots.

The cluster 6.2.3 can register the 5.2.2 repository too but it must be read-only. This way the 6.2.3 cluster can restore data from 5.2.2 but cannot create or delete snapshots.

If you want to create snapshots of the 6.2.3 cluster then you have to register a new repository in a different container/bucket/folder, as you did.

Is there an open issue for what you mentioned?

Finally found it :) See #29052 where a user had a similar issue than yours and the pull request #26127 that fixed the issue #25878.

I'm going to close this issue in favor of #29052.

Ok, thanks for the update.

Thanks @redlus, the description of your problem helped me to find a reproducible scenario where such NPE is thrown. See #30140 for some boring information :)

Was this page helpful?
0 / 5 - 0 ratings