Nixpkgs: NixOS python tests intermittently fail with Fatal Python error

Created on 5 May 2020  路  13Comments  路  Source: NixOS/nixpkgs

Describe the bug

[...]
osd2 # [    5.294516] systemd[1]: Starting Create Volatile Files and Directories...
monA # [    3.796676] systemd[1]: Reached target Containers.
osd0 # [    6.469903] systemd[1]: Found device /dev/hvc0.
osd2 # [    5.307825] systemd[1]: Started Create Volatile Files and Directories.
(0.67 seconds)
monA: waiting for success: ceph osd stat | grep -e '3 osds: 3 up[^,]*, 3 in'
Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads

Thread 0x00007ffff64c4700 (most recent call first):
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output

Thread 0x00007ffff5cc3700 (most recent call first):
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output

Thread 0x00007ffff54c2700 (most recent call first):
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output

Thread 0x00007ffff4cc1700 (most recent call first):
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output

Current thread 0x00007ffff7893f40 (most recent call first):
(0.79 seconds)
monA: waiting for success: ceph -s | grep 'mgr: a(active,'
(0.79 seconds)
monA: waiting for success: ceph -s | grep 'HEALTH_OK'
/nix/store/vilh5ays3ymz3xkwk0fri2a70lha7pfc-stdenv-linux/setup: line 1271:     3 Aborted                 (core dumped) LOGFILE=$out/log.xml tests='exec(os.environ["testScript"])' /nix/store/wkmmx61nyr0ndh8kh1vpyh8l6csc5jp9-nixos-test-driver-basic-multi-node-ceph-cluster/bin/nixos-test-driver
builder for '/nix/store/33jy4sdn9jzxkmrdlidr6pzgxilanh3y-vm-test-run-basic-multi-node-ceph-cluster.drv' failed with exit code 134
error: build of '/nix/store/33jy4sdn9jzxkmrdlidr6pzgxilanh3y-vm-test-run-basic-multi-node-ceph-cluster.drv', '/nix/store/5p6zrdni0q24p2qwmxmjdx3n72mwc9yk-vm-test-run-basic-single-node-ceph-cluster.drv' failed

To Reproduce
This one is hard to reproduce. The above snippet is from nix-build nixos/tests/ceph-single-node.nix on a feature commit on top of release-20.03 e5cb3ef6a13057e5431323e3d78ffc0a98c7b88e

I know @aanderse has seen it happen in other python tests.

bug nixos testing

All 13 comments

I also saw this in various tests.

As explained in https://github.com/NixOS/nixpkgs/pull/86486#issuecomment-626163102, our method of writing to stderr isn't thread-safe. I gave moving all this to the native python logging module, which should be threadsafe in my wip-test-driver-logging branch.

@flokli Your patch works well! It fixes a test that was failing consistently in a private code base. Could you make a PR?

nitpick: I noticed it's possible to do logger.getLogger() to get the root logger. Not sure if that's equal to logger.getLogger("") but at least for configuring the handler I think it's best to use logger.getLogger(). Maybe I don't know about some python conventions though :)

@roberth This was mostly a quick hack to see if it fixes things :-)

I'm still waiting on some reaction to https://github.com/NixOS/nixpkgs/pull/87191#issuecomment-630491258. Knowing whether log outputs will still be part of the output or not, and if there will be some structured/nested format will affect that code.

@flokli Perhaps it makes sense to add the logging code to #86486 which also needs it

Hm, that's yet another instance of "general flakyness". I downgraded the referenced PR to a draft.

I'd really like to see the necessary discussion to happen, before putting more time into this.

Well darn, I'd been redirecting stdout to stderr to see test progress in real time, but with machine.succeed("... >&2") now, this leads to hung tests.

This is easily reproducible with the test below. With 1000 iterations, it passes. With 2000 iterations, it produces the could not acquire lock error (though adding a time.sleep(1) at the end of the script eliminates the problem). With 3000 iterations or anything higher, regardless of sleeps, the test hangs.

let
  usePython = true;

  driver =
    if usePython
    then <nixpkgs/nixos/tests/make-test-python.nix>
    else <nixpkgs/nixos/tests/make-test.nix>;

  test = import driver {
    name = "nixos-log-flooding-test";

    machine = { pkgs, ... }: {
      environment.systemPackages = [
        (pkgs.writeScriptBin "do-test" ''
          seq 2000 | while read N; do
            echo "************************************************** $N"
          done
        '')
      ];
    };

    testScript = if usePython then ''
      machine.succeed("do-test >&2")
    '' else ''
      $machine->succeed("do-test >&2");
    '';
  };
in test

Just stumbled upon this as well. If we don't fix this before 20.09 we shouldn't drop the perl driver IMHO.

As expressed in https://github.com/NixOS/nixpkgs/pull/86486#issuecomment-632869756, I'd like to get some feedback on how to treat "test results" - see https://github.com/NixOS/nixpkgs/pull/87191#issuecomment-626136749 before putting more time into https://github.com/NixOS/nixpkgs/pull/86486.

Sorry, should've read the entire thread more closely. Will take a look.

FYI #86486 has been closed due to a lack of time to get this finished. It would be awesome if anyone could finish this! :)

The logging has been moved to use pythons logging module - this should have been fixed by https://github.com/NixOS/nixpkgs/pull/96254.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

langston-barrett picture langston-barrett  路  3Comments

ayyess picture ayyess  路  3Comments

teto picture teto  路  3Comments

tomberek picture tomberek  路  3Comments

chris-martin picture chris-martin  路  3Comments