Beats: [Elastic-agent] Sometimes backend processes are not started

Created on 14 Sep 2020  路  12Comments  路  Source: elastic/beats

I.e. https://beats-ci.elastic.co/blue/organizations/jenkins/e2e-tests%2Fe2e-testing-mbp/detail/PR-289/1/pipeline/350

It causes some flakiness in the test suite

Log error

[2020-09-14T18:59:45.170Z] time="2020-09-14T18:59:44Z" level=warning msg="filebeat process is not running in the container yet" container=ingest-manager_elastic-agent_1 desiredState=started elapsedTime=126.374302ms error="filebeat process is not running in the container yet" process=filebeat retry=1

Agent log

[2020-09-14T18:59:44.908Z] elastic-agent_1 | 2020-09-14T18:59:44.632Z ERROR application/emitter.go:88 Failed to render configuration with latest context from composable controller: operator: failed to execute step sc-run, error: failed to start '/usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat: text file busy: failed to start '/usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat: text file busy

cc/ @ph @EricDavisX

Ingest Management bug failed-test v7.10.0

All 12 comments

This flakiness affects both stand-alone and fleet mode tests

@blakerouse could you take a look, this looks like a concurrency issue. I am not sure it due because of composable input. Maybe a concurrency issue in the structure?

I can confirm that since past week, this is happening much more frequently (I'd say almost always), when in the past it happened from time to time, possibly being resolved restarting the CI job.

@mdelapenya / @EricDavisX I've moved the issue into the beats repository.

Pinging @elastic/ingest-management (Team:Ingest Management)

I have also found this issue while testing this change https://github.com/elastic/beats/pull/21213

It looks like the operation-install happens twice, but second one is skipped and then agent tries to start the service, while the first installation is happening:

2020-09-22T11:05:31.672Z    DEBUG   operation/operation_retryable.go:83 running operation 'operation-verify' of the block 'retryable block: operation-fetch operation-verify'
2020-09-22T11:05:31.683Z    DEBUG   operation/operator.go:230   running operation 'operation-install' for metricbeat.8.0.0
2020-09-22T11:05:31.814Z    INFO    operation/operator.go:226   operation 'operation-install' skipped for metricbeat.8.0.0
2020-09-22T11:05:31.814Z    DEBUG   operation/operator.go:230   running operation 'operation-start' for metricbeat.8.0.0
2020-09-22T11:05:31.973Z    INFO    log/reporter.go:40  2020-09-22T11:05:31Z: type: 'STATE': sub_type: 'STARTING' message: Application: metricbeat--8.0.0[c5c35fa6-046d-429b-9f3a-ca680467ba80]: State changed to STARTING: Starting
2020-09-22T11:05:31.973Z    ERROR   application/emitter.go:88   Failed to render configuration with latest context from composable controller: operator: failed to execute step sc-run, error: failed to start '/usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat: text file busy: failed to start '/usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat: text file busy

The check for the install operation seems to check only if the directory exists: https://github.com/elastic/beats/blob/78dc0cfea8a13e9303992f4350368a5d6f7e3802/x-pack/elastic-agent/pkg/artifact/install/dir/dir_checker.go#L21-L24

So it seems that second installation request thinks that the service is correctly installed because the first one has already created the directory.

If we need concurrent installations of the same service maybe an option is to have an exclusive lock file in the installation directory.

Update: with latest updates we see this error:

[2020-09-24T07:06:01.287Z] elastic-agent_1 | 2020-09-24T07:03:38.989Z ERROR operation/operation_retryable.go:85 operation operation-verify failed
[2020-09-24T07:06:01.287Z] elastic-agent_1 | 2020-09-24T07:03:38.989Z DEBUG application/periodic.go:40 Failed to read configuration, error: could not emit configuration: operator: failed to execute step sc-run, error: operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * fetching asc file from '/usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * check detached signature: openpgp: invalid signature: hash tag doesn't match
[2020-09-24T07:06:01.287Z] elastic-agent_1 |
[2020-09-24T07:06:01.287Z] elastic-agent_1 | : operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * fetching asc file from '/usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * check detached signature: openpgp: invalid signature: hash tag doesn't match
It seems that the .ASC is not present in none of the expected locations: filesystem or artifactory. So no process is able to start

@mdelapenya Can you file your last comment in a seperate issue, that is unrelated to the initial issue. Just want to be able to track it correctly and not get them confused.

Sure, I can do it. I thought it was related because the observed results are the same: backend proceses are not started.

I'll do as you suggest and convert that comment in a new issue. Thanks!

@michalpristas @blakerouse can this issue be closed?

Yes this can be closed.

@mdelapenya quick check to confirm that this is not being trapped by the e2e-testing framework at this point.

Was this page helpful?
0 / 5 - 0 ratings