It causes some flakiness in the test suite
[2020-09-14T18:59:45.170Z] time="2020-09-14T18:59:44Z" level=warning msg="filebeat process is not running in the container yet" container=ingest-manager_elastic-agent_1 desiredState=started elapsedTime=126.374302ms error="filebeat process is not running in the container yet" process=filebeat retry=1
[2020-09-14T18:59:44.908Z] elastic-agent_1 | 2020-09-14T18:59:44.632Z ERROR application/emitter.go:88 Failed to render configuration with latest context from composable controller: operator: failed to execute step sc-run, error: failed to start '/usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat: text file busy: failed to start '/usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat: text file busy
cc/ @ph @EricDavisX
This flakiness affects both stand-alone and fleet mode tests
@blakerouse could you take a look, this looks like a concurrency issue. I am not sure it due because of composable input. Maybe a concurrency issue in the structure?
I can confirm that since past week, this is happening much more frequently (I'd say almost always), when in the past it happened from time to time, possibly being resolved restarting the CI job.
@mdelapenya / @EricDavisX I've moved the issue into the beats repository.
Pinging @elastic/ingest-management (Team:Ingest Management)
I have also found this issue while testing this change https://github.com/elastic/beats/pull/21213
It looks like the operation-install happens twice, but second one is skipped and then agent tries to start the service, while the first installation is happening:
2020-09-22T11:05:31.672Z DEBUG operation/operation_retryable.go:83 running operation 'operation-verify' of the block 'retryable block: operation-fetch operation-verify'
2020-09-22T11:05:31.683Z DEBUG operation/operator.go:230 running operation 'operation-install' for metricbeat.8.0.0
2020-09-22T11:05:31.814Z INFO operation/operator.go:226 operation 'operation-install' skipped for metricbeat.8.0.0
2020-09-22T11:05:31.814Z DEBUG operation/operator.go:230 running operation 'operation-start' for metricbeat.8.0.0
2020-09-22T11:05:31.973Z INFO log/reporter.go:40 2020-09-22T11:05:31Z: type: 'STATE': sub_type: 'STARTING' message: Application: metricbeat--8.0.0[c5c35fa6-046d-429b-9f3a-ca680467ba80]: State changed to STARTING: Starting
2020-09-22T11:05:31.973Z ERROR application/emitter.go:88 Failed to render configuration with latest context from composable controller: operator: failed to execute step sc-run, error: failed to start '/usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat: text file busy: failed to start '/usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat: text file busy
The check for the install operation seems to check only if the directory exists: https://github.com/elastic/beats/blob/78dc0cfea8a13e9303992f4350368a5d6f7e3802/x-pack/elastic-agent/pkg/artifact/install/dir/dir_checker.go#L21-L24
So it seems that second installation request thinks that the service is correctly installed because the first one has already created the directory.
If we need concurrent installations of the same service maybe an option is to have an exclusive lock file in the installation directory.
Update: with latest updates we see this error:
[2020-09-24T07:06:01.287Z] elastic-agent_1 | 2020-09-24T07:03:38.989Z ERROR operation/operation_retryable.go:85 operation operation-verify failed
[2020-09-24T07:06:01.287Z] elastic-agent_1 | 2020-09-24T07:03:38.989Z DEBUG application/periodic.go:40 Failed to read configuration, error: could not emit configuration: operator: failed to execute step sc-run, error: operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * fetching asc file from '/usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * check detached signature: openpgp: invalid signature: hash tag doesn't match
[2020-09-24T07:06:01.287Z] elastic-agent_1 |
[2020-09-24T07:06:01.287Z] elastic-agent_1 | : operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * fetching asc file from '/usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * check detached signature: openpgp: invalid signature: hash tag doesn't match
It seems that the .ASC is not present in none of the expected locations: filesystem or artifactory. So no process is able to start
@mdelapenya Can you file your last comment in a seperate issue, that is unrelated to the initial issue. Just want to be able to track it correctly and not get them confused.
Sure, I can do it. I thought it was related because the observed results are the same: backend proceses are not started.
I'll do as you suggest and convert that comment in a new issue. Thanks!
@michalpristas @blakerouse can this issue be closed?
Yes this can be closed.
@mdelapenya quick check to confirm that this is not being trapped by the e2e-testing framework at this point.