Ddev: mkcert error on Travis upon creating the web container

Created on 24 Sep 2019  路  39Comments  路  Source: drud/ddev

Describe the bug
In Travis CI environment, in a not steadily reproducible way, we face with the following error:

Failed to start schuler-scholar: web container failed: log=, err=container exited, please use 'ddev logs -s web to find out why it failed`

After adding ddev logs to the CI script, that's the output of that:

+ sudo chown -R 2000:2000 /mnt/ddev-global-cache/ /home/travis/.ssh /home/travis/.drush /home/travis/.gitconfig /home/travis/.my.cnf

+ '[' -d /mnt/ddev_config/homeadditions ']'

+ cp -r /mnt/ddev_config/homeadditions/. /home/travis/

+ mkcert -install

ERROR: failed to save CA key: open /mnt/ddev-global-cache/mkcert/rootCA-key.pem: permission denied

To Reproduce
Steps to reproduce the behavior:
On a Travis-based environment, start a DDEV instance with Solr aside, skipping the configuration for mkcert there.

Expected behavior
ddev web container should start.

Screenshots
image

Version and configuration information (please complete the following information):

  • Host computer OS and Version:
Build language: php
Build group: stable
Build dist: trusty
Build id: 128908179
Job id: 238529129
Runtime kernel version: 4.4.0-101-generic
travis-build version: 24382795bc479c6eb9a0651f15dd00c2dca32750
Build image provisioning date and time
Tue Dec  5 19:58:13 UTC 2017
Operating System Details
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:   trusty
  • Docker version information (use docker version) and insert it here.
docker-ce_18.06.3~ce~3-0~ubuntu_amd64.deb
  • ddev version information (use ddev version)
v1.11.0
  • config.yaml contents for the misbehaving project
APIVersion: v1.11.0
name: foobar-foobar
type: drupal8
docroot: web
php_version: "7.3"
webserver_type: nginx-fpm
router_http_port: "8081"
router_https_port: "8043"
xdebug_enabled: false
additional_hostnames: []
additional_fqdns: []
mariadb_version: "10.2"
nfs_mount_enabled: false
provider: default
hooks:
  post-import-db:
  - exec: drush config-set system.site name Foobar
  - exec: drush en -y environment_indicator
  - exec: drush cr
  - exec: drush uli
  post-start:
  - exec: drush site-install foobar -y --db-url=mysql://db:db@db/db --account-pass=admin
      --existing-config
  - exec: drush en foobar_migrate
  - exec: drush sapi-c
  - exec: drush migrate:import --group=foobar
  # For Code sniffer
  - exec: composer global require drupal/coder:^8.3.1
  - exec: composer global require dealerdirect/phpcodesniffer-composer-installer
  # For Drupal's private files
  - exec: mkdir /var/www/private
  - exec: drush uli
use_dns_when_possible: true
timezone: ""
  • Do you have any custom configuration (nginx, php, mysql) in the .ddev folder? If so, have you tried without them?
    N/A

All 39 comments

It's likely that Travis is the guilty, my idea is that DDEV could degrade gracefully, if that's possible.

Could you change the travis config to Ubuntu 16.04 and see what the result is? 14.04 is no longer supported, as you know, and I'm not sure what that means about docker support on there. If docker isn't properly handling volumes then this could happen. Or if travis docker config denies permissions on volumes somehow?

If this isn't regularly reproducible, you might consider starting a more trivial project before the main one, to make sure the volume gets properly created?

What docker setup and version is travis using?

Doesn't travis now have ssh capabilities so you can go in there and study what's going on?

It might be possible to make the container forgive the existence of a required directory, but that would expand to a thousand problems, since that mount is always assumed to be available.

Thanks for the suggestions!
I upgraded Ubuntu to bionic and Docker to amd64 docker-ce amd64 5:19.03.2~3-0~ubuntu-bionic, I still have exactly the same error message.

Or if travis docker config denies permissions on volumes somehow?

Due to the randomness of the bug, it's not likely.

Doesn't travis now have ssh capabilities so you can go in there and study what's going on?

Yup, on private projects, you can execute it in debug mode and indeed SSH in, I'll give it a try.

If you get in there, do a docker inspect ddev-<project>-web. That might tell us something about the volume mount or lack thereof. Also if you get in, take a look at the contents and permissions in /mnt/ddev-global-cache

I'm happy to help debug this if it's recreatable and you're able to give me access. My pubkey is at https://www.randyfay.com/sites/default/files/rfay.pub_.txt

Best possible scenario is that we set up a trivial travisci project that can demonstrate this that isn't related to any actual work.

Thanks for the offer.
As I think about more, we have 5+ client projects on DDEV, using Travis and only one fails. So it must be project specific, however on the long run, I see that by 50% chance, it succeeds.
I'll try to kill all the project specific stuff and still reproduce it, now I think i will find some time for it.

Oh, it's possible you could have the mkcert problem described in https://github.com/FiloSottile/mkcert/pull/193 - I don't understand exactly how that would cause the problem inside the container but it's an interesting problem. We had the problem on CircleCI with Ubuntu 16.04, that's why it was studied and solved. But there the issue was running mkcert on the host/Ubuntu testbot.

There are built artifacts for that PR at https://github.com/rfay/mkcert/releases/tag/v1.4.1-alpha1

Well, we don't install mkcert inside Travis, so it rules out https://github.com/FiloSottile/mkcert/pull/193 , right? mkcert is not bundled in DDEV in any ways, so if I don't install manually inside Travis, it does not play any role, doesn't it?

As I ported some bits from one project to another, two projects have this issue now.
I'll port the things into the public repository of https://github.com/Gizra/drupal-elm-starter , then hopefully I can link here the failed build.

mkcert has be be installed inside travis, because the CA has to be created. If you install with homebrew on linux it gets bundled. But it has to be there, and mkcert -install should certainly be run in your test process if you're using https. That said, I'm not sure any of this matters. But it will be interesting to sort it out.

@rfay
https://github.com/Gizra/drupal-elm-starter/pull/224
Here we have the PR for the public repository. I'll let you know when we have the wanted failure there, but essentially that's the diff that caused issues for one of the projects.
I might need to update DDEV version there too, I'll see.

@rfay
Now we caught this hiding bug.
Failed job:
https://travis-ci.org/Gizra/drupal-elm-starter/jobs/599102837

Successful job:
https://travis-ci.org/Gizra/drupal-elm-starter/jobs/599095306

The diff between is in markdown, so we can say the source code is 100% identical.

Source code:
https://github.com/Gizra/drupal-elm-starter/blob/8.x-travis-failure-trigger/ci-scripts/install_server.sh

DDEV config:
https://github.com/Gizra/drupal-elm-starter/blob/8.x-travis-failure-trigger/server/.ddev/config.yaml

I'll contact the support of Travis in the meantime, because I suspect that it might be their environment somehow as well.

Now I experiment with https://github.com/Gizra/drupal-elm-starter/pull/225 to figure out to see if inside one specific Travis job, the error happens again and again, or only one time.

Travis support gave up:

Thank you for following up, and we'll be happy to help. Regarding your inquiry on the location of our servers, our build servers run on GCP infrastructure and are in us-central1 region, so in the US. I had an opportunity to look at the diff checker my colleague shared again and couldn't find any recent changes on our side that could potentially cause this. Have you had an opportunity to run a debug build to investigate further? With access to the VM and the intimate understanding of the code base, you are in the best position to troubleshoot this issue further. The debug mode is explained in https://docs.travis-ci.com/user/running-build-in-debug-mode/

(Public repositories do not have a debug mode for the record, but the private ones surely have.)

If you want to get together and troubleshoot this together in person this afternoon, we could do that. It could be fun. I think we'll be back from the day's activities by 3pm or 4pm.

Are you able to do this predictably now, have a repeatable situation?

@rfay No way to do this predictably unfortunately.
Now I have an idea. I try to make it broken in the same way on localhost....
Like removing permissions on that folder after starting the process and see if it results in the same error or not.

More news on this.
I added this to our private projects:
./install -y || ./install -y || ddev logs

And after one failure, at the 2nd attempt, DDEV just installs fine. So we might say it's indeed related to DDEV.

What does "install" do?

I do have lots of experience with docker in test runners on CircleCI, and it does fail periodically for reasons I don't understand.

But your failure is still the "permissions" error on ddev-global-cache/mkcert/rootCA-key.pem: permission denied"right? That would be different from anything I've seen.

Is this still the exact same error?

  • ddev-dbserver does sudo chown -R "$(id -u):$(id -g)" /mnt/ddev-global-cache/ ~/.my.cnf during startup.
  • ddev-webserver does sudo chown -R "$(id -u):$(id -g)" /mnt/ddev-global-cache/ ~/{.ssh*,.drush,.gitconfig,.my.cnf} during startup
  • On start, if the router is not running, ddev uses a dummy busybox container to push the root CA into the router: _, out, err := dockerutil.RunSimpleContainer("busybox:latest", "", []string{"sh", "-c", "mkdir -p /mnt/ddev-global-cache/composer && mkdir -p /mnt/ddev-global-cache/mkcert && chmod 777 /mnt/ddev-global-cache/* && cp -R /mnt/mkcert /mnt/ddev-global-cache"}, []string{}, []string{}, []string{"ddev-global-cache" + ":/mnt/ddev-global-cache", caRoot + ":/mnt/mkcert"}, "", true)

It looks like the failure you're seeing happens during the webserver container startup, which should be after ddev uses busybox to push the certs. But I note that ddev does not chmod or chown the root CA (global-cache/mkcert) after pushing them.

Thanks! About ./install
https://github.com/Gizra/drupal-elm-starter/blob/8.x-travis-failure-trigger/server/scripts/helper-functions.sh#L186

It does a ddev remove && ddev start cycle to have a clean state and configures a few things here and there, invokes composer, etc..

Just a couple of notes on that, probably not relevant:

  • ddev poweroff would be better than the "remove" approach, because it makes sure that all containers have been killed off. (And rm/remove is deprecated a long time ago).
  • I recommend ddev poweroff unconditionally for the same reason, doing "remove" just because docker-compose.yaml is discovered isn't as good.

Is it possible for containers to be reused after already having run ddev? In that case, you could have a ddev-router that was alive (or stopped/paused) and that could affect your situation.

Our tests also do an unconditional ddev stop -RO <projectname> to make sure that not just containers but volumes are killed off.

What uid do things run as on the test runner? root? ddev should refuse to run as root...

If your original test-runner side mkcert -install is done as root, then the resulting root CA being pushed by ddev into the docker volume will have root ownership, and that could cause this. I think that's likely what's going on.

If your original test-runner side mkcert -install is done as root

We skip that step and use http:// for reaching the site.

Ok, I'll change the way to resetting the environment as you described.

So mkcert is even not installed inside the container / VM of Travis.

Do you always see this warning in the putput: ("mkcert may not be properly installed, we suggest installing it for trusted https support, brew install mkcert nss, choco install -y mkcert, etc. and then mkcert -install")

@rfay Yep, that's a constant part of the output. I ignored it for the reason that we do not need HTTPS inside Travis.

It looks to me like this stanza:

# This will install the certs from $CAROOT (/mnt/ddev-global-cache/mkcert)
mkcert -install

# VIRTUAL_HOST is a comma-delimited set of fqdns, convert it to space-separated and mkcert
sudo CAROOT=$CAROOT mkcert -cert-file /etc/ssl/certs/master.crt -key-file /etc/ssl/certs/master.key ${VIRTUAL_HOST//,/ } localhost 127.0.0.1 ${DOCKER_IP} web ddev-${DDEV_PROJECT:-}-web ddev-${DDEV_PROJECT:-}-web.ddev_default && sudo chown $UID /etc/ssl/certs/master.*

should sudo mkdir -p $CAROOT && sudo chown -R ... && sudo chmod before running the mkcert -install.

I think your problem would go away if you installed mkcert on test runners and ran mkcert -install

@rfay I'll remove that double install workaround then soon and do the mkcert install. Anyways it's better to test with HTTPS, browser-based tests could catch mixed content anomalies that would happen in production, etc.

Another way to experiment is to try #1848 just by using webimage: drud/ddev-webserver:20191025_mkcert_creation but that's a fairly poor approach for an intermittent problem, whereas adding mkcert -install on the host is likely to fix it permanently with no side-effects or unfortunate dead code left laying around.

I have no idea why this problem would be intermittent though. Is the test runner machine/VM created each time from scratch? Or does it get reused?

@rfay It's created each time from scratch. There's a minimal caching that Travis CI uses, but that mounted folder is certainly not included.

:clap:

I pulled the PR that claims to "fix" this, but please follow up and let's see what happens with it. It seems to me the PR does no harm.

Just a late follow-up here: as we now have mkcert inside Travis (and use HTTPS for the frontend testing), we did not have this random anomaly at all.

I was just thinking about you and wondering if it got solved. Great to hear the good news.

actually we have the exact same problem where 1 out of 3 three builds fail on circleci with that error. it happens since we upgraded to latest ddev.

there must be some race conditions with permissions or something similar, as it's not happening everytime

log output:

+ sudo chown -R 1001:1002 /mnt/ddev-global-cache/ /home/circleci/.ssh /home/circleci/.ssh-agent /home/circleci/.drush /home/circleci/.gitconfig /home/circleci/.my.cnf
+ '[' -d /mnt/ddev_config/homeadditions ']'
+ cp -r /mnt/ddev_config/homeadditions/. /home/circleci/
+ mkcert -install
ERROR: failed to save CA key: open /mnt/ddev-global-cache/mkcert/rootCA-key.pem: permission denied

@rthideaway Well, one immediate solution is to install mkcert on the host container and invoke it, it will solve it in a robust way, so it unblocks your CI env.

In addition to just running mkcert -install, you might consider using ddev 1.12.0-alpha1, which got the fix we believe we discovered.

seems like installing mkcert on host machine and invoking it with mkcert -install solved the issue,
let's see if the builds are stable now, if not we'll try 1.12.0-alpha1 then

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rfay picture rfay  路  6Comments

rfay picture rfay  路  5Comments

mfrieling picture mfrieling  路  7Comments

localhorst picture localhorst  路  7Comments

rickmanelius picture rickmanelius  路  4Comments