Docker-mailserver: Tests sometimes fail at first step

Created on 24 Aug 2019 · 31Comments · Source: tomav/docker-mailserver

Failure at some builds:

not ok 1 checking default relay host: default relay host is added to main.cf

Context

Sometimes works, sometimes fails. Started after the refactoring by @martin-schulze-vireso (probably due to a shorter delay before starting the tests).

Expected Behavior

OK.

Actual Behavior

Not ok in some builds.

Possible Fix

Longer sleep or improve the readiness check.

Steps to Reproduce

Build many times.

Your Environment

Travis CI, see https://travis-ci.org/erik-wramner/docker-mailserver/builds/576171948. Unfortunately no details.

Source

erik-wramner

Most helpful comment

my problem when I want to merge a new PR is to decide if it has broken something or if the tests are green [...] As it is I have to login and press Rebuild until the build succeeds and it takes 20 minutes every time.

I am fully with you on that, just wanted to make sure that this stop gap solution does not become the permanent "fix" like the sleeps before.

martin-schulze-vireso on 6 Sep 2019

👍2

All 31 comments

Some observations: My previous trial to reproduce this only ran the first test file, not the whole suite.
I tried again to reproduce that with the full suite. I got the error once but no output. Then I ran them again at least 5 times and finally got to a state where each following run (I tried about 5 again) would fail before starting the actual tests with:

# Setup sieve & create filtering folder (INBOX/spam)
docker cp "`pwd`/test/config/sieve/dovecot.sieve" mail:/var/mail/localhost.localdomain/user1/.dovecot.sieve
Error: No such container:path: mail:/var/mail/localhost.localdomain/user1

This was yesterday, today the setup works again without a difference in the setup. I saw the error above for my first PR here, where the setup shell failed to start postfix due to an error in the setup. My hypothesis is: there is some external service that is queried during the setup process (e.g. ClamAV signature server) which has rate limiting and bans the test suite after too many runs.

Maybe something similar is happening in this case as well? Did you see the error on the first run for the day as well or only after a few runs? Did it persist after that?

martin-schulze-vireso on 25 Aug 2019

Not sure as Travis could make many builds for different branches/projects. It failed on the first Travis build for my project, but who knows what IP it is using? We could run it without the docker build step behind a firewall or without networking. That way we can exclude network connectivity if it works.

erik-wramner on 25 Aug 2019

The build was buffered in my case so I don't think that was the problem.

martin-schulze-vireso on 26 Aug 2019

I drilled deeper into my problem: For some reason the [email protected] entry in postfix-accounts.cf is missing in some runs, which in turn leads to the cp error.

EDIT 1
I nailed it down further: There seems to be concurrent access to the file. ~Even after removing a stale instance of one of the test containers which might have interfered I get really strange behavior:~ I changed the Makefile to contain following code:

generate-accounts:
    tail -F test/config/postfix-accounts.cf&
    docker run --rm -e [email protected] -e MAIL_PASS=mypassword -t $(NAME) /bin/sh -c 'echo "$$MAIL_USER|$$(doveadm pw -s SHA512-CRYPT -u $$MAIL_USER -p $$MAIL_PASS)"' > test/config/postfix-accounts.cf
    grep 'user1@localhost' test/config/postfix-accounts.cf
    exit 2

This gives me a very short feedback loop: The grep will fail when the first command messed up.

EDIT 2
I was not able to reproduce this anymore after the stale container was gone. Since this is a file that is watched and changed from within the containers via the check-for-changes.sh I am afraid that we get some unwanted interaction between concurrently active containers.

Since the modifications are doing some sort of normalization they will eventually settle down but during editing (e.g. in the generate-accounts target) they might interfere with the actual editing being done.

martin-schulze-vireso on 27 Aug 2019

@fbartels @martin-schulze-vireso I have two ideas here.

One I have already started on and implemented partially: lock the accounts file around code that changes it. That should work, though I'm no expert in locking semantics in shared docker volumes. The problem (except for possible docker issues) is that there are multiple similar files and so far I'm only locking the accounts file, plus only when it is changed in the container, not when there are direct outside modifications that are not using the setup command.

The other solution requires more work but is cleaner: never ever change the .cf files in the containers. When the files are normalized with sed, make the changes to a copy and not in-place in the original files. We could then mount the .cf files read only to make sure the containers behave. We would need to make an exception when we want to use setup for making changes, but we can do that in a specific container (already using mail I think) that mounts the files read/write.

Given how often the tests fail on these things perhaps it is worth implementing the second solution after all. I doubt that this is ever a problem in real life, where there is only a single container.

EDIT: a third approach would be to copy the config directory and give each container its own copy, which can then be mounted read/write. That may be less work and the containers can't affect each other. Still not as clean as mounting read only.

erik-wramner on 27 Aug 2019

I also thought about mounting readonly or copying the folder. Using a copy per container might be a good short term solution and a fast way to try if it fixes the problem. In the long run I find the separation into the user input file and a normalized copy cleaner. You never know what the user is doing with their file so they might have some automatic script as well.

martin-schulze-vireso on 27 Aug 2019

Yes, the third approach sounds like a good idea (least amount of work, should be easy to verify). I'd imagine that you could create the config folder for these images mit mktemp, which will also mean that tests should have the same outcome when running repeatedly.

fbartels on 27 Aug 2019

While mktemp is nice it should be good enough to create tests/config/ and make one directory per container that is created when we start the container and dropped afterwards? That is a bit easier to work with as it is immediately obvious what container a given config directory is used with. The names produced by mktemp are not very friendly. If all containers always have exactly the same config that is not needed, but I suspect it can vary a bit.

erik-wramner on 27 Aug 2019

How should we proceed? I think moving the remaining containers' creation from the Makefile into .bats files would allow for a single bash function that does the per container setup and teardown. However, this would mean we have to do the move first.
I am currently working on another batch of these moves but I am stuck at mail_with_ldap which suddently fails with the email being present twice in the maildir. I am not sure how fast this will progress and if we should wait for it.

martin-schulze-vireso on 5 Sep 2019

I think it would be cleaner. However, the build is quite unstable lately. I added a commit right now that may help a bit. I'm waiting for another log message ("up and running") and I have increased the maximum wait time from 60 to 120 seconds. Not sure if it makes a difference. Anyway, take your time, but if you feel something is ready for prime time, push it!

erik-wramner on 5 Sep 2019

Just to avoid misunderstanding, by 'it would be cleaner' you mean it would be cleaner to first move the containers into individual .bats files and then have a common setup function?

'm waiting for another log message ("up and running") and I have increased the maximum wait time from 60 to 120 seconds.

I'd hope that both changes are not necessary: The intention of this command was to wait for the configuration files to be final, not for the service to be running. Furthermore, if the timeout being too short was the problem, there should be a clear error message stating this.

martin-schulze-vireso on 5 Sep 2019

Yes, that is what I meant.

As for the changes being necessary or not I still don't know, but at least the three builds after the change were green. Most of the builds immediately before the change were red, failing either at the first step (relay in main.cf) or at the login with added user. Could be coincidence though.

erik-wramner on 6 Sep 2019

If (and that is a big if) the problems stem from the concurrent changes to the configuration files then waiting longer, e.g. for the later log message, will "fix" the problem. The Makefile's gratuitous use of sleeps probably also tried to avoid that. Still I would feel more comfortable if we were able to definitely pin down the problem by reproducing it deterministically to get a verified fix. Even if we do not see the problem anymore after applying your solution above or by using separate config directories per container, we cannot be sure that this really solved the problem or if slight timing changes just shifted the probability of it happening.

IMHO the shift to a later log message just achieves the same brittle "success" that the longer sleeps did.

martin-schulze-vireso on 6 Sep 2019

You're right, it is a brittle success. However, my problem when I want to merge a new PR is to decide if it has broken something or if the tests are green. If the tests succeed most of the time that is much easier than if they fail most of the time. As it is I have to login and press Rebuild until the build succeeds and it takes 20 minutes every time.

But yes, when we find and fix the real problem we should shorten or get rid of the arbitrary waits.

erik-wramner on 6 Sep 2019

👍1

my problem when I want to merge a new PR is to decide if it has broken something or if the tests are green [...] As it is I have to login and press Rebuild until the build succeeds and it takes 20 minutes every time.

I am fully with you on that, just wanted to make sure that this stop gap solution does not become the permanent "fix" like the sleeps before.

martin-schulze-vireso on 6 Sep 2019

👍2

Do you have any idea how we can verify that it is indeed the concurrent configs? My main problem is that I can't even reproduce the issue locally and I'd like to have a shorter loop than say 5min to see if it occurred after I make a change (to debug).

martin-schulze-vireso on 6 Sep 2019

Not really, I'm a bit stumped. Another theory is that it could be something with the file cache and docker volumes, but I've not found anyone who has had issues with that in this way, so it is far-fetched. If we do separate the configs and/or make them read-only we can at least see if that helped.

erik-wramner on 6 Sep 2019

I got the problem with "default relay host is added to main.cf" locally and this time the problem was that there was no docker container named mail_with_default_relay. It must have failed for some reason, not clear why.

erik-wramner on 7 Sep 2019

@martin-schulze-vireso it seems at least one container remains stopped and not removed after a clean run. It is not named, so I don't know which one it is. I'll try to find it.

erik-wramner on 7 Sep 2019

IRC the container is stopped immediately if the start script fails for some reason. This might also happen with invalid configuration but it is just a guess. Furthermore, you have to be very careful with running locally: since the teardown_file functionality is just emulated and not really implemented, it will not run if you abort mid test, thus retaining the container. It seems we have a rather old bats-core implementation in use, the newer ones allow for running tests that match a pattern, in this case you could run bats -f last to run all teardown_file functions from all tests.

I am working on getting full support for teardown_file in bats-core but they seem to be stuck.

martin-schulze-vireso on 7 Sep 2019

@martin-schulze-vireso thanks for the explanation. I've changed cleanup a bit in #1254 to get rid of the anonymous containers.

erik-wramner on 7 Sep 2019

Okay, I think I've got a plan now: I will try to pin down the symptoms via Travis with some additional debug output, which should be "easy" given the 1 in 3 "success" rate and the given hypothesis.
When we got conclusive evidence of a stable error behavior (I'd say after ~5 fails so about 15 runs) I will try to reproduce it locally and then take some shortcuts that preserve this behavior to get a shorter feedback loop for debugging. When this succeeds we can venture a fix and try it again to see if it helped on Travis (again about 10 runs). This is no proof but I think the best we can do under the circumstances.

I hope to speed up the Travis rounds by parallelizing via fast paced pushing of dummy commits. Are you okay with spamming Travis in this way? (Alternatively, do you know of any other system were I can reproduce the problem? My local system has a far smaller error rate.)

martin-schulze-vireso on 9 Sep 2019

I'm ok with spamming Travis if we need to and apparently we do. One thing you could do if you like to speed things up is to create your own Travis account linked to your Github account. Then you can build the image in your fork without pushing it to the main repo, meaning you have full control over the builds. I'm doing that.

Another thought is that it fails much more often on Travis and that it kind of helps (fewer errors) with longer sleeps. Perhaps we should try to reproduce locally in a weaker environment? Run it in a VM that only has a single CPU, for example. Or burn CPU in some way on your normal machine while building.

Yesterday I had problems with the very first test (relay in main.cf missing), with the relay host tests (all of them would fail) and with the login after adding an account test. Those are good areas to focus on.

erik-wramner on 9 Sep 2019

I already found two issues when trying to set me up so here I go:

Invisible timeout failure

1259 solves a problem that I already suspected but forgot about: My timeout function failed silently so we did not recognize it as a timeout problem.

When starting clean locally, I noticed heavy load on my system. At one point I saw openssl generating DH params. Somehow this was much more prolonged on the first run than the following ones (even though I make cleaned thouroughly). Anyways, looking at the build you linked to @erik-wramner I now see the exact problem pattern that is solved in my PR above: the test fails without error output. I conclude that we ran into a timeout there, likely during the DH param creation.
Hence your timeout extension improved the success rate. I assume besides the obvious load factor we might also have luckier DH param generation runs that need less work.

Not everyone uses the postfix-accounts.cf lock

I took another hard look at the code in check-for-changes.sh because of my suspicions from above and was surprised to see the locking code that you put in just a month ago @erik-wramner. In theory this should prevent the exact problem that I suspect. However, there might be a caveat: I did not see any use of locks on the consuming side in start-server.sh so we still might have a race condition with that.

martin-schulze-vireso on 11 Sep 2019

Well, in retrospect it may have been stupid, but I intentionally didn't use the lock in start-server.sh because it always completes the steps where it uses the files before it starts the check for changes script. As long as there is only one container starting at a time there can be no conflicts. Of course in the tests we are spinning up multiple containers, so if they start in parallel we can run into problems. It also doesn't protect against "manual" changes where the file is modified without using the setup script and hence without getting the lock.

erik-wramner on 11 Sep 2019

@martin-schulze-vireso the tests should not generate dhparams, as they are mounted into the image to /tmp/docker-mailserver/dhparams.pem. If that doesn't work we need to fix it. I'm pretty sure it works for some tests, because the run time for the tests went down several minutes when I added that. Good that you spotted it!

erik-wramner on 11 Sep 2019

This is what I got in one run:

not ok 28 first
# (from function `repeat_until_success_or_timeout' in file test/test_helper/common.bash, line 22,
#  from function `wait_for_finished_setup_in_container' in file test/test_helper/common.bash, line 34,
#  from function `setup_file' in file test/mail_with_relays.bats, line 23,
#  from function `run_setup_file_if_necessary' in file test/test_helper/common.bash, line 43,
#  from function `setup' in test file test/mail_with_relays.bats, line 4)
#   `run_setup_file_if_necessary' failed
# 12f351ed2ed98272b172d64d3ab2c4c332535f280623f351cb9932f6dc6f52f5

So the container did not start on time.

erik-wramner on 11 Sep 2019

I tried to debug some more and had an eye on the straining processes. I saw dh param generation again several times. The first test seems to create them always even on repeated runs without cleaning in between. The observed runtimes ranged from 40-90s so this dances nicely around the 60s timeout mark.

There are two containers that don't mount test/config as /tmp/docker-mailserver: mail_with_relays and mail_with_default_relay (incidentally the one which is created in the first test!), they use test/config/relay-hosts instead, which does not contain a dhparams.pem.

EDIT
Okay, retaining the dhparam.pem seems to avoid the regeneration. I will add them to git just like the others.

martin-schulze-vireso on 11 Sep 2019

It is much more stable now! Very good news.

erik-wramner on 15 Sep 2019

Nice to hear. Still sounds like there is an occasional hiccup?

martin-schulze-vireso on 15 Sep 2019

Well, the "login with added account" test failed once, but overall most of the false alarms seem to be gone. I'll close this one for now.

erik-wramner on 16 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings