Elasticsearch: CI failures due to indices not being cleaned up as expected

Created on 28 Aug 2019  路  48Comments  路  Source: elastic/elasticsearch

There appears to be some kind of problem with the cleanup code that ensures all indices are cleaned up between tests. It's difficult to be sure, but there seems to be a trend of a wide variety of tests failing because of a ResourceAlreadyExistsException when creating an index with no visible cause in the test itself.

I noticed this in this master intake build, which is IndexLifecycleIT.testStartStopILM, which creates an index as one of the first operations it does - the cluster should be blank at this point.

Looking at build stats for failures with messages containing resource_already_exists_exception, we can see it was very quiet prior to July 23, and has built more and more, suggesting this may be related to recent build changes:
Screen Shot 2019-08-28 at 11 56 40 AM
[Note: This graph excludes a 10 minute period on Aug. 15 which had 608 failures, to make the scale clearer]

There is no clear relation between the tests which fail due to indices already existing that shouldn't and a specific functional area. It appears to happen most often to client and YAML tests, although not exclusively. In a few spot checks I've seen:

  • IndexLifecycleIT.testStartStopILM
  • CCSDuelIT.testPagination
  • test {yaml=search/240_date_nanos/date_nanos requires dates after 1970 and before 2262}
  • IndicesClientIT.testCloseExistingIndex
  • DatafeedJobsRestIT.testLookbackWithGeo

All of these appear to have failed while creating indices when the cluster should be a blank slate.

I believe this is also the cause of https://github.com/elastic/elasticsearch/issues/45605 and https://github.com/elastic/elasticsearch/issues/45805

:DeliverBuild >test >test-failure Delivery

Most helpful comment

On the other hand, there's no evidence of this happening on non master branches where we don't run with --parallel

All 48 comments

Pinging @elastic/es-core-infra

Also #45600

All of these appear to have failed while creating indices when the cluster should be a blank slate.

Have a look at the server side logs. Do they show something like:

[2019-08-14T18:07:42,945][WARN ][o.e.i.IndicesService     ] [integTest-0] [test/ee-7zmiUQGayvOq6LgEShg] failed to delete index
org.elasticsearch.env.ShardLockObtainFailedException: [test][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]

If so then it鈥檚 not just the test client cleanup code that鈥檚 to blame, but the same server code that customers will be running in production, and this is looking like a big worry for 7.4. See https://github.com/elastic/elasticsearch/issues/45600#issuecomment-521620912

@DaveCTurner I believe you're working on this now as well. So ping :)

Yep I'm seeing if I can reproduce this, specifically the test failure https://github.com/elastic/elasticsearch/issues/45956. I did see that test fail on my machine reporting a similar ShardLockObtainFailedException after 1400 iterations while concurrently running stress -c 16 -i 4 -m 8 -d 4, but unfortunately I was running on a commit from before my holidays (302d29c87059658f776fb9c22bc76362bd46d0a0). I've added more logging and updated to a more recent master and am trying again.

The testStartStopILM failure did not have any indication that the deletion failed. There was an exception roughly aligned with the failure, but it was caused by an IndexNotFoundException on the index that was double-created, after it's logged as being deleted:

[2019-08-28T17:05:49,025][WARN ][o.e.i.s.RetentionLeaseBackgroundSyncAction] [integTest-0] unexpected error during the primary phase for action [indices:admin/seq_no/retention_lease_background_sync], request [RetentionLeaseBackgroundSyncAction.Request{retentionLeases=RetentionLeases{primaryTerm=1, version=1, leases={peer_recovery/sIFqVv9GQ86guDpZYRI0pg=RetentionLease{id='peer_recovery/sIFqVv9GQ86guDpZYRI0pg', retainingSequenceNumber=0, timestamp=1567011946292, source='peer recovery'}}}, shardId=[baz][0], timeout=1m, index='baz', waitForActiveShards=0}]
org.elasticsearch.index.IndexNotFoundException: no such index [baz]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:190) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:116) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteSingleIndex(IndexNameExpressionResolver.java:278) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.concreteIndex(TransportReplicationAction.java:234) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:651) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onNewClusterState(TransportReplicationAction.java:795) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:311) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:169) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:120) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:112) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:792) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:771) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1091) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1200) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1174) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:60) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:56) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onFailure(TransportReplicationAction.java:408) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.handleException(TransportReplicationAction.java:402) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$runWithPrimaryShardReference$3(TransportReplicationAction.java:370) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:64) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:253) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.seqno.RetentionLeaseBackgroundSyncAction.shardOperationOnPrimary(RetentionLeaseBackgroundSyncAction.java:97) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.seqno.RetentionLeaseBackgroundSyncAction.shardOperationOnPrimary(RetentionLeaseBackgroundSyncAction.java:54) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:916) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:108) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:393) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:315) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2753) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:112) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:285) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:237) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2727) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:857) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:311) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:274) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:228) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:196) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:277) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:376) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:184) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter.lambda$inbound$1(ServerTransportFilter.java:112) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:246) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:306) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:317) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:244) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:139) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter.inbound(ServerTransportFilter.java:103) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:284) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:724) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

Duplicate of #45605 ?

Another build failure, seems to be related: https://gradle-enterprise.elastic.co/s/wbwkzcamo7a4e

@original-brownbear looks like this is still causing builds to fail. Any thoughts ?

@atorok the team is looking into this, this morning.

We have an explanation for why index creation might result in a resource_already_exists_exception response. By default the HttpAsyncClient in use will retry if it hasn't received a response within 30 seconds. We have observed situations where the CI machine grinds to a halt for a while, causing the creation of an index to take over 30 seconds (recalling that it doesn't just create the index, but also waits for the primaries to start). This triggers a retry from the client, but the second attempt discovers that the first attempt already created the index and returns the resource_already_exists_exception that we're seeing.

We can reproduce this by waiting after the index creation completed for the client to time out:

diff --git a/server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java b/server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java
index 9bfbec9..1dc3163 100644
--- a/server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java
+++ b/server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java
@@ -215,6 +215,11 @@ public class MetaDataCreateIndexService {
                         } else {
                             logger.trace("[{}] index created and shards acknowledged", request.index());
                         }
+                        try {
+                            Thread.sleep(30000);
+                        } catch (InterruptedException e) {
+                            throw new AssertionError(e);
+                        }
                         listener.onResponse(new CreateIndexClusterStateUpdateResponse(response.isAcknowledged(), shardsAcknowledged));
                     }, listener::onFailure);
             } else {

We can extend the client's timeout to work around this, and I think it would be good to stop it from retrying at all to save us from going through the same deeply confusing investigation the next time round. However the fundamental issue is why it is taking over 30 seconds to create an index in CI.

The one that we investigated in detail took ~6sec to write the metadata to disk when the index was first created (confirmed by TRACE logging of MetaStateService) then ~4sec to apply the resulting cluster state and then ~22 sec to actually start the primary. The slow metadata writing suggests there's an issue with IO on the CI worker.

Opened https://github.com/elastic/elasticsearch/pull/46455 to double the timeout for now. Disabling retries is a bit of a trickier change as far as I can see because we have some tests with node restarts and such and I don't know if that's going to work out of the box if we start pinpointing requests at single nodes to stop retrying.

The slow metadata writing suggests there's an issue with IO on the CI worker.

This is just IO contention with running many ES nodes and tests in parallel. We limit this _somewhat_ but we are hamstrung a bit by some limitations in the build. This should improve in Gradle 6.0 as we will be able to define explicit permits for starting tests nodes. Right now Gradle treats a test task spinning up 3 nodes the same as a task spinning up a single node and this can cause problems.

@mark-vieira , in all cases I looked deeply into and also in #46144, it looked like there were concurrent gradle tasks building distributions or the like, for instance these (from #46144):

18:33:18 > Task :distribution:archives:buildNoJdkDarwinTar
18:33:29 > Task :test:fixtures:krb5kdc-fixture:composeDown
18:33:36 > Task :distribution:packages:buildDeb
18:33:38 > Task :distribution:archives:buildNoJdkLinuxTar

The cause could be either parallel test activity or the fact that we do these heavy throughput IO operations during testing. I think most of our rest style tests are not really that heavy on IO and would think that the likelihood of those disturbing each other enough through IO is small. This is a hunch, not a verified statement. But I think trying to avoid building distributions while testing is worth trying out to see if this fixes the issue.

I'm a bit concerned about writing this off as IO contention only and worry that CI also has some other issue affecting it or making it worse mainly because:

  1. It happened in the past
  2. we have been running with --parallel for quite some time now I would have expected this to surface sooner
  3. I'm not 100% sure but I think we saw this with @original-brownbear in a bwc test before those ran with --parallel

I'm not 100% sure but I think we saw this with @original-brownbear in a bwc test before those ran with --parallel

Yes, we did see this issue with non-parallel builds before I think. But that's not necessarily unexpected IMO (just less likely).

I think for the cases we've seen so far (with better logging activated), IO contention is the cause. If there's other issues, those will surface in time I guess and we'll have an easier time tracking them down once we have the obvious IO issue out the way? :)

I'm not arguing that IO contention is not real, and definitely don't want to spoil the party this is a great find. It would be great if that were the only issue, I just think we don't need to let our guards down yet.

Another data-point: before enabling parallel I ran lots of tests, some of them on larger instances with much higher levels of parallelism ( on the same disk size ) and never ran into this issue, nor other failures for that matter, builds succeeded , it was just that the improvement was too small to justify the cost.

I'm running some tests locally with your change reverted to try to reproduce this, both on standard instances and some with local SSDs.

On the other hand, there's no evidence of this happening on non master branches where we don't run with --parallel

Here's some screen shots from the GCP console for checkPart1 ( the failure did not reproduce ).

UPDATE: This is on a 300GB disk, whereas CI uses 150GB disks.

centos

centos

ubuntu 14

ubuntu14

ubuntu 18

ubuntu18

The same running on a local SSD

UPDATE: This is on a 300GB disk, whereas CI uses 150GB disks.

centos

centos

ubuntu 14

ubuntu14

ubuntu 18

ubuntu18

On a side note, running on a local ssd saves ~5 minutes of build time ( ~ 16% )

I'm also running some tests on a ramdisk.
With how CI is structured currently we could probably switch to a ramdisk as well and it might be easier to set up too

Builds on a ramdisk save ~ 7 minutes ( ~ 23% ) and have the advantage of competently eliminating disk io from the equation.

That sounds great and a good solution for PR and intake builds.

I wonder if it would make sense to still keep some jobs running against SSDs to ensure that we also run our integration tests against storage with less predictable and slower timings. For instance testAckedIndexing might behave less chaotic on RAM disk vs. SSD and this could be a problem towards relying on the result of such tests.

I agree it would be dangerous for _all_ of Elasticsearch CI to run against RAM disks. We could end up shipping software that regularly failed with the sort of storage that everyone in production will be using.

We have an OS comparability matrix. I think that would be the right place to keep running against real disks and with minimum optimization ( possibly at lower parallel levels ).
Most of the other tests could continue to use it, but I'm working to enable it only for runs that use --parallel, so right now that will be intake and pr against master only.

I'we been running checkPart1 in a loop for the past couple of hours now and it doesn't seem like IO is being saturated. There seems to be short peaks, but nothing that would suggest it was saturated for multiple seconds.
This seems to happen much more on ubuntu than on centos ( out of 50 failures only one was on centos ) so it's not out of the question that something ( e.x. checking for updates ) is started while the build runs. It would have to be fairly IO intensive to push it over the edge.

@original-brownbear do you think we should revert the increased timeout once we start running on a ramdisk ? At least temporarily considering your comment from the Pr. If the disk IO was not the whole story this could make that "something else" harder to detect.

@mark-vieira I'we ran without the Gradle cache enabled in my tests, would you imagine it adds significant IO overhead ?

do you think we should revert the increased timeout once we start running on a ramdisk ? At least temporarily considering your comment from the Pr. If the disk IO was not the whole story this could make that "something else" harder to detect.

Fine by me to go back to 30s (could even go lower to like 20s since we have a bunch of spots where the internal timeouts are 30s for master operation and 30s is literally the hardest to interpret value IMO).

I would prefer to disable retries when reverting to 30s, that would make it a much more clean failure. I think pushing for a lower timeout limit is asking for extra trouble at a time when we do not need it (now), we can pursue that at a different time (if at all), once this is all resolved.

I started running with -Dscan.capture-task-input-files --build-cache and got 2 failures on the first try ( this is the same hash that has not failed in the past 4 hours ). Strangely the GCP console stopped updating at the same time so not sure where we stand on the disk io.

Also running on a ramdisk doesn't seem to yield significant savings ( ~ 3 minutes or less) in CI (tested in a local CI setup), this is with caching enabled.

Another thing to point out is from the GCP:

However, to ensure consistent performance for more general use of the boot device, > use either an SSD persistent disk as your boot disk or use a standard persistent disk
that is at least 200 GB in size.

Looking at build-stats the problem only started appearing on the 8th of August .
The disk size was reduced on the 2th og august in elastic/infra@44757b9f94 and we added -Dscan.capture-task-input-files on the 15th and have a big spike in number of failures on that day.

My working theory is that these two actions combined degraded the disk performance.

@mark-vieira if Gradle writes anywhere besides the project directory for the caching we should add that to the ramdisk too.

The tests I ran weren't quite representative unfortunately as the script I use was not updated and was using a 300 GB disk , double than what CI uses at 150GB

After looking at the docs for scan.capture-task-input-files it doesn't seem to do anything that would be too IO intensive and running a test confirms that so not sure what caused the spike that day, maybe just degraded disk IO on GCP that day ? The commit seems to be a coincidence.

In summary

  • intake and PR checks now run in ramdisks (elastic/infra#14376)
  • the disk size in CI was increased back to 350GB (elastic/infra#14379)
  • the local gradle cache is moved so it too will be on a ramdisk (#46486)

Here's a run on on ubuntu 18. First one is with scan.capture-task-input-files, second without.
ubuntu2
ubunut1

It doesn't seem to be worse than the 300GB disk, possibly because the number of vCPUs also plays a role.

scan.capture-task-input-files is mainly only going to affect the size of the build scan payload. It shouldn't incur any additional IO since Gradle has to snapshot input files regardless of whether that option is enabled.

In terms of IO use for the build cache, it's certainly going to increase. For every build cache miss we are going pack the output and ship it up to the remote cache. One thing we can do here is disable pushing to the local cache in CI since we use ephemeral workers. This would help with the overhead associated with cache misses on builds that otherwise don't push (PR builds).

But I think trying to avoid building distributions while testing is worth trying out to see if this fixes the issue.

We could potentially help this situation by building all the distributions early via ordering rules.

Looks like there were no failures since introducing the ram disks :tada:

@mark-vieira does the caching implementation push as the build progresses or it's something that happens at the end of the build. It would be nice if there were a way to delay it so it doesn't interfere with it.

@mark-vieira does the caching implementation push as the build progresses or it's something that happens at the end of the build. It would be nice if there were a way to delay it so it doesn't interfere with it.

Cache pushing happens at the end of execution of each task, so it happens as the build progresses. Because it's part of task execution it's effectively limited by existing parallelism. With the local cache disabled the additional IO here should be minimal as we've already built the outputs and Gradle streams the result to the remote server. The only IO overhead is in the form of read, and this is probably going to be bottlenecked by network IO to the remote cache anyhow. From what I can tell it's mostly write IOPS hurting tests.

In otherwords, I suspect the addition of the remote build cache amounts to a negligible IO overhead compared to the build/tests themselves. I've done experiments in the past that confirm this suspicion as well.

Not sure those are the same failure @matriv it's more likely something with that particular test. I suggest we open a different issue for it and mute

I'm going ahead and close this issue to avoid confusion.
After moving to a ram-disk, this has not happened again in over a week.
@original-brownbear we can track the removal of retries from the client in a different ticket if you would like. That's the reason I initially left this one open.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bittusarkar picture bittusarkar  路  43Comments

JagathJayasinghe picture JagathJayasinghe  路  105Comments

geekpete picture geekpete  路  59Comments

eryabitskiy picture eryabitskiy  路  94Comments

mayya-sharipova picture mayya-sharipova  路  45Comments