Describe the bug
[...]
osd2 # [ 5.294516] systemd[1]: Starting Create Volatile Files and Directories...
monA # [ 3.796676] systemd[1]: Reached target Containers.
osd0 # [ 6.469903] systemd[1]: Found device /dev/hvc0.
osd2 # [ 5.307825] systemd[1]: Started Create Volatile Files and Directories.
(0.67 seconds)
monA: waiting for success: ceph osd stat | grep -e '3 osds: 3 up[^,]*, 3 in'
Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads
Thread 0x00007ffff64c4700 (most recent call first):
File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output
Thread 0x00007ffff5cc3700 (most recent call first):
File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output
Thread 0x00007ffff54c2700 (most recent call first):
File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output
Thread 0x00007ffff4cc1700 (most recent call first):
File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output
Current thread 0x00007ffff7893f40 (most recent call first):
(0.79 seconds)
monA: waiting for success: ceph -s | grep 'mgr: a(active,'
(0.79 seconds)
monA: waiting for success: ceph -s | grep 'HEALTH_OK'
/nix/store/vilh5ays3ymz3xkwk0fri2a70lha7pfc-stdenv-linux/setup: line 1271: 3 Aborted (core dumped) LOGFILE=$out/log.xml tests='exec(os.environ["testScript"])' /nix/store/wkmmx61nyr0ndh8kh1vpyh8l6csc5jp9-nixos-test-driver-basic-multi-node-ceph-cluster/bin/nixos-test-driver
builder for '/nix/store/33jy4sdn9jzxkmrdlidr6pzgxilanh3y-vm-test-run-basic-multi-node-ceph-cluster.drv' failed with exit code 134
error: build of '/nix/store/33jy4sdn9jzxkmrdlidr6pzgxilanh3y-vm-test-run-basic-multi-node-ceph-cluster.drv', '/nix/store/5p6zrdni0q24p2qwmxmjdx3n72mwc9yk-vm-test-run-basic-single-node-ceph-cluster.drv' failed
To Reproduce
This one is hard to reproduce. The above snippet is from nix-build nixos/tests/ceph-single-node.nix on a feature commit on top of release-20.03 e5cb3ef6a13057e5431323e3d78ffc0a98c7b88e
I know @aanderse has seen it happen in other python tests.
I also saw this in various tests.
As explained in https://github.com/NixOS/nixpkgs/pull/86486#issuecomment-626163102, our method of writing to stderr isn't thread-safe. I gave moving all this to the native python logging module, which should be threadsafe in my wip-test-driver-logging branch.
@flokli Your patch works well! It fixes a test that was failing consistently in a private code base. Could you make a PR?
nitpick: I noticed it's possible to do logger.getLogger() to get the root logger. Not sure if that's equal to logger.getLogger("") but at least for configuring the handler I think it's best to use logger.getLogger(). Maybe I don't know about some python conventions though :)
@roberth This was mostly a quick hack to see if it fixes things :-)
I'm still waiting on some reaction to https://github.com/NixOS/nixpkgs/pull/87191#issuecomment-630491258. Knowing whether log outputs will still be part of the output or not, and if there will be some structured/nested format will affect that code.
@flokli Perhaps it makes sense to add the logging code to #86486 which also needs it
Hm, that's yet another instance of "general flakyness". I downgraded the referenced PR to a draft.
I'd really like to see the necessary discussion to happen, before putting more time into this.
Well darn, I'd been redirecting stdout to stderr to see test progress in real time, but with machine.succeed("... >&2") now, this leads to hung tests.
This is easily reproducible with the test below. With 1000 iterations, it passes. With 2000 iterations, it produces the could not acquire lock error (though adding a time.sleep(1) at the end of the script eliminates the problem). With 3000 iterations or anything higher, regardless of sleeps, the test hangs.
let
usePython = true;
driver =
if usePython
then <nixpkgs/nixos/tests/make-test-python.nix>
else <nixpkgs/nixos/tests/make-test.nix>;
test = import driver {
name = "nixos-log-flooding-test";
machine = { pkgs, ... }: {
environment.systemPackages = [
(pkgs.writeScriptBin "do-test" ''
seq 2000 | while read N; do
echo "************************************************** $N"
done
'')
];
};
testScript = if usePython then ''
machine.succeed("do-test >&2")
'' else ''
$machine->succeed("do-test >&2");
'';
};
in test
Just stumbled upon this as well. If we don't fix this before 20.09 we shouldn't drop the perl driver IMHO.
As expressed in https://github.com/NixOS/nixpkgs/pull/86486#issuecomment-632869756, I'd like to get some feedback on how to treat "test results" - see https://github.com/NixOS/nixpkgs/pull/87191#issuecomment-626136749 before putting more time into https://github.com/NixOS/nixpkgs/pull/86486.
Sorry, should've read the entire thread more closely. Will take a look.
FYI #86486 has been closed due to a lack of time to get this finished. It would be awesome if anyone could finish this! :)
The logging has been moved to use pythons logging module - this should have been fixed by https://github.com/NixOS/nixpkgs/pull/96254.