Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_121
Plugins installed:
JVM version (java -version):
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
OS version (uname -a if on a Unix-like system):
Linux HOSTNAME 3.0.101-0.113.TDC.1.R.0-default #1 SMP Fri Dec 9 04:51:20 PST 2016 (ca32437) x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Listing snapshots for the repository-hdfs plugin snapshot repository should work, but instead there is a java.lang.SecurityException: access denied ("java.lang.reflect.ReflectPermission" "suppressAccessChecks") from the JVM security manager.
Steps to reproduce:
curl -i -u USERNAME -H 'Accept: application/json' -H 'Content-Type: application/json' "https://$(hostname -f):9200/_snapshot/CLUSTERNAME/_all?pretty")Provide logs (if relevant):
Stacktrace from missing security policy permission
...
# first call to curl -i -u USERNAME -H 'Accept: application/json' -H 'Content-Type: application/json' "https://$(hostname -f):9200/_snapshot/CLUSTERNAME/_all?pretty"
org.elasticsearch.transport.RemoteTransportException: [master-HOSTNAME][IP_ADDRESS:9301][cluster:admin/snapshot/get]
Caused by: java.lang.SecurityException: access denied ("java.lang.reflect.ReflectPermission" "suppressAccessChecks")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_121]
at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_121]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_121]
at java.lang.reflect.AccessibleObject.setAccessible(AccessibleObject.java:128) ~[?:1.8.0_121]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:396) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:335) ~[?:?]
at com.sun.proxy.$Proxy34.getServerDefaults(Unknown Source) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:640) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.shouldEncryptData(DFSClient.java:1755) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.newDataEncryptionKey(DFSClient.java:1761) ~[?:?]
at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:210) ~[?:?]
at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) ~[?:?]
at org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:581) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2933) ~[?:?]
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815) ~[?:?]
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740) ~[?:?]
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385) ~[?:?]
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:706) ~[?:?]
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:647) ~[?:?]
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:918) ~[?:?]
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:974) ~[?:?]
at java.io.DataInputStream.read(DataInputStream.java:100) ~[?:1.8.0_121]
at org.elasticsearch.common.io.Streams.copy(Streams.java:79) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.common.io.Streams.copy(Streams.java:60) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:762) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:140) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:97) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:55) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:166) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.2.jar:5.5.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
...
# second call to curl -i -u USERNAME -H 'Accept: application/json' -H 'Content-Type: application/json' "https://$(hostname -f):9200/_snapshot/CLUSTERNAME/_all?pretty"
[2017-09-01T14:08:10,615][WARN ][r.suppressed ] path: /_snapshot/CLUSTERNAME/_all, params: {pretty=, repository=CLUSTERNAME, snapshot=_all}
org.elasticsearch.transport.RemoteTransportException: [master-HOSTNAME][IP_ADDRESS:9301][cluster:admin/snapshot/get]
Caused by: java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129) ~[?:?]
at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:116) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:160) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:335) ~[?:?]
at com.sun.proxy.$Proxy34.getListing(Unknown Source) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1681) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1665) ~[?:?]
at org.apache.hadoop.fs.Hdfs.listStatus(Hdfs.java:257) ~[?:?]
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1806) ~[?:?]
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1802) ~[?:?]
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) ~[?:?]
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1808) ~[?:?]
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1767) ~[?:?]
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1726) ~[?:?]
at org.elasticsearch.repositories.hdfs.HdfsBlobContainer$6.run(HdfsBlobContainer.java:145) ~[?:?]
at org.elasticsearch.repositories.hdfs.HdfsBlobContainer$6.run(HdfsBlobContainer.java:142) ~[?:?]
at org.elasticsearch.repositories.hdfs.HdfsBlobStore.lambda$execute$0(HdfsBlobStore.java:132) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_121]
at java.security.AccessController.doPrivileged(AccessController.java:713) ~[?:1.8.0_121]
at org.elasticsearch.repositories.hdfs.HdfsBlobStore.execute(HdfsBlobStore.java:129) ~[?:?]
at org.elasticsearch.repositories.hdfs.HdfsBlobContainer.listBlobsByPrefix(HdfsBlobContainer.java:142) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.listBlobsToGetLatestIndexId(BlobStoreRepository.java:930) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.latestIndexBlobId(BlobStoreRepository.java:908) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:746) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:140) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:97) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:55) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:166) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.2.jar:5.5.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
I was able to fix this with the following:
# Fix for security permission error
cat /home/elasticsearch/.java.policy
grant {
permission java.lang.reflect.ReflectPermission "suppressAccessChecks";
};
Interestingly the repository-hdfs plugin already has a Java security policy file that requests suppressAccessChecks permissions. https://github.com/elastic/elasticsearch/blob/master/plugins/repository-hdfs/src/main/plugin-metadata/plugin-security.policy#L26 It looks like from the stacktrace that the error doesn't go through any of the repository-hdfs classes however.
As you point out, the problem here is that the call is coming from core where we do not and will not grant suppressAccessChecks.
@jbaiera Would you take a look please?
@jasontedor yea I agree that adding suppressAccessChecks to the whole JVM is not preferred but didn't have a good way to work around it in the short term. We are currently moving from 2.x to 5.x for Elasticsearch and this was a pain point on Friday. Since we are in a development environment, I can provide/test other things if necessary.
If I were to guess, the bulk of this issue comes from trying to get server defaults for the encryption introduced in Hadoop.
at org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:640) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.shouldEncryptData(DFSClient.java:1755) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.newDataEncryptionKey(DFSClient.java:1761) ~[?:?]
I didn't see a quick way to short circuit this process by adding a Hadoop conf setting saying that encryption is either enabled or not. It looks like this code has to go to the Namenode.
I was going to try webhdfs:// instead of hdfs:// but saw that the repository-hdfs plugin explicitly checks for the scheme to be hdfs://.
Some of these issues might go away if the plugin were to allow webhdfs:// as an option since that is over HTTP instead of direct RPC calls. I saw #24455 is something related to that idea.
Indeed the problem seems to actually be here: https://github.com/apache/hadoop/blob/branch-2.8.1/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L640
serverDefaults = namenode.getServerDefaults();
So it's trying to do reflection in namenode something that we do not allow to do that? I say this from
at com.sun.proxy.$Proxy34.getServerDefaults(Unknown Source) ~[?:?]
Wonder if we could use the dfs.trustedchannel.resolver.class setting to force this to be bypassed: https://github.com/apache/hadoop/blob/branch-2.8.1/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/sasl/SaslDataTransferClient.java#L206 ?
As you point out, the problem here is that the call is coming from core where we do not and will not grant suppressAccessChecks.
This is correct. Since there's a core utility in the stack trace (org.elasticsearch.common.io.Streams.copy) it's tripping up the security manager. What should really be happening is the stream should be wrapped in something that does the HDFS calls with privileges. Took a quick look and found that this seems to already be done in master with https://github.com/elastic/elasticsearch/pull/22793. Perhaps this just needs to be backported to the 5.x branches.
It looks like #22793 is the only non test code change to repository-hdfs that isn't in the 5.x line.
There's a reason that #22793 is not in 5.6. That change was predicated on removing socket permissions from core which was only done in 6.0+. Thus, we can not simply backport that change even if it picked cleanly (which it doesn't), it was engineered with different assumptions than are present in 5.6. It's plausible that some of the code in #22793 will help address the issue here but instead we have to:
So I found the case where this can happen. I'm trying to boil it down to the simplest steps. From what I've gathered so far:
For reference my test case is here:
I've haven't had a chance to see if this is fixed in 6.x or master yet. I've been focusing on trying to recreate in a standalone test case with 5.5.2.
Thank you so much @risdenk, that's awesome effort and much appreciated. @jbaiera will work closely with you on this starting next week. And 5.5.2 is definitely the right place to focus efforts (or the 5.6 branch is fine too).
The code path differs for a readonly repository in this method:
Non Read Only Path
The BlobStoreRepository.getRepositoryData() method calls BlobStoreRepository.latestIndexBlobId() which in in the non readonly case will eventually call the DFSClient.getServerDefaults() method with the correct filecontext through HdfsBlobContainer. This eventually filters down to the correct HDFS blobstore implementation with doPrivileged. The DFSClient.getServerDefaults() is then cached for 60 minutes so the next call without the file context from Streams.copy works.
Failure Read Only Path
The BlobStoreRepository.getRepositoryData() method calls BlobStoreRepository.latestIndexBlobId() which in in the readonly case doesn't call out to HDFS it looks like. This then means no cached server detaults in DFSClient.getServerDefaults() and so the next call without the file context from Streams.copy fails to work.
DFSClient and suppressAccessChecksSide note: interestingly travis ci doesn't seem to fail where as locally on my Mac it does. All the testing is in Docker so not sure what the difference is right now.
For reference, I switched the base Elasticsearch image to 6.0.0-beta2 and could not reproduce the same error. The HDFS repository lists correctly even with readonly.
@risdenk Thanks for all the work toward distilling the issue down to a few simple steps. This is incredibly helpful information. I'll start working toward getting a failing test built and then we'll know where to go from there.
@jbaiera - Glad the information is helpful.
I finally got the gradle integTestSecure working and even after editing the secure_hdfs_repository REST test yaml files I wasn't able to reproduce the problem. I can logically see how the code path I'm hitting in my environment is hit but seem to have a hard time producing the issue in the test suite.
I ran into a few problems with Vagrant 2.0 and on a Mac the vagrant test fixture just stops. On a linux box the test suite works ok.
@risdenk I wanted to check in and share what I've got on this: I also was running into issues with reproducing this via the rest integration test tooling. I decided to try and reproduce by standing up the local hdfs fixture with a local elasticsearch instance and re-tracing your provided steps. This did indeed reproduce the error you've described.
This raises the question why it doesn't cleanly reproduce in the rest integration test framework. Diving deeper uncovered what looks like two unrelated bugs (one with the repository, and one potentially in the HDFS client) but hasn't yielded any answers for why it doesn't reproduce. I'll update back here with more info as it's available.
@jbaiera Glad I'm not crazy it doesn't reproduce in the integration test but it does if you stand up a cluster. Last night I was able to get it to reproduce once in the rest integration test framework.
The branch for this is here: https://github.com/risdenk/elasticsearch/tree/test_readonly_repository
I created this from the 5.5 branch. https://github.com/elastic/elasticsearch/compare/5.5...risdenk:test_readonly_repository
From that branch I was able to recreate the access denied with gradle integTestSecure from the plugins/repository-hdfs directory.
Checking in again on this since it's been the parent of a few real brain teasers:
I was able to get the rest test framework to reproduce the error, but only when the test case that reproduces it is run relatively early in the suite. This is a problem since our testing suite has no real sense of ordering (The test spec names with the leading numbers are a bit misleading). I found that one of the reasons that the integration test passes when the test case is run later is because the suppress access check permissions are being skipped since the methods in question have already been set accessible by privileged code in previous test cases.
That being said, since there is at least a test that reproduces it now, I'll just get a PR up for fixing it soon. We'll have to just accept that the ordering of the tests matters for reproducing it since the state of the HDFS Client doesn't give us much choice.
This is a problem since our testing suite has no real sense of ordering (The test spec names with the leading numbers are a bit misleading).
Ah that explains a lot! I couldn't figure out why the integration tests passed when I explicitly set the order with the leading numbers.
Do the tests run in parallel as well? If each test setup and tore down the repository there shouldn't be any overlap since the HDFS client is per repository.
The overlap in this scenario comes from all the tests executing on the same JVM and class instances. By the time the reproducing test case runs, previous test cases in the suite have already set the methods in question to be accessible from privileged callsites. When the reproducing test case attempts to run, the methods that it would normally try to set accessible are already accessible. The code skips trying to set them, and thusly avoids the security exceptions.
That said, there are still some other discrepancies between running the rest test suite and running against a local cluster that I haven't pinned down yet, but my hunch so far has been that the differences are more likely caused by how rapid the calls are made than the substance of the calls themselves. The HDFS client is pretty sensitive when a security manager is installed.
After getting the requisite vagrant testing support commits backported to the 5.x branches and shoring up the last of the changes needed, I've opened #26714 for this.
Thanks @jbaiera!
I have merged #26714 and backported to the 5.5.x branch. This should make it into a release soon.