Tox: detox worked, `tox -p auto` hangs with `trial -j` tests

Created on 10 Mar 2019  路  21Comments  路  Source: tox-dev/tox

I've been happily running my test suite in detox for the past year or so. Today I upgraded to tox with the --parallel option, and suddenly my unit-tests environment hangs, despite mypy, lint, and integration-tests (which uses the same test runner and similar options) all passing and exiting nicely.

Here are some hopefully relevant details:

  • The test runner in question is trial -j 8, so the subprocess has subprocesses of its own, which may be confounding things.

  • unit-tests is the longest running environment.

  • When the tests hang, I see a --installpkg process as well as all of my trial worker processes hanging. As such, I tried --parallel--safe-build just in case, but it didn't change the behavior at all.

If submitting a BUG please provide:

  • [ ] Minimal reproducible example or detailed description, assign "bug"

Sorry to say that I can't produce a minimal reproducer; thus far I've only managed to produce this on a proprietary test suite.

  • [x] OS and pip list output

macOS 10.14.3

Package    Version
---------- -------
filelock   3.0.10
pip        18.1
pluggy     0.9.0
py         1.8.0
setuptools 40.5.0
six        1.12.0
toml       0.10.0
tox        3.7.0
virtualenv 16.4.3
wheel      0.32.2
normal

All 21 comments

Also, when I hit Control-C after the tests hang, I see this error:

Traceback (most recent call last):
  File "/Users/glyph/.local/bin/tox", line 11, in <module>
    sys.exit(cmdline())
  File "/Users/glyph/.local/venvs/tox/lib/python3.6/site-packages/tox/session.py", line 47, in cmdline
    main(args)
  File "/Users/glyph/.local/venvs/tox/lib/python3.6/site-packages/tox/session.py", line 54, in main
    retcode = build_session(config).runcommand()
  File "/Users/glyph/.local/venvs/tox/lib/python3.6/site-packages/tox/session.py", line 467, in runcommand
    return self.subcommand_test()
  File "/Users/glyph/.local/venvs/tox/lib/python3.6/site-packages/tox/session.py", line 591, in subcommand_test
    retcode = self._summary()
  File "/Users/glyph/.local/venvs/tox/lib/python3.6/site-packages/tox/session.py", line 739, in _summary
    status = venv.status
AttributeError: 'VirtualEnv' object has no attribute 'status'
^CException ignored in: <module 'threading' from '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py'>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 1294, in _shutdown
    t.join()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

Happy to provide additional details if I can.

Do a run with the live output enabled and triple verbosity and submit the output like that too. Thanks!

@gaborbernat I just finished my first run with live output (I assume you mean the -o option?) The problem doesn't occur in that case -- so I guess I've found a workaround at least!

I'm trying a run with -vvv now, but without -o, to see if I get any interesting information that way.

Yes -o, not any closer to reproduce the issue though 鈽癸笍

@gaborbernat I suspect that this is an infelicitous interaction between whatever trial is doing to manage its subprocesses and whatever tox is doing to manage its own. Is there anything you can think of that I could look for in the implementation of trial, or things we might be doing in our test suite, which might trigger this behavior?

I'm not familiar with trial at all at the moment, so can't think of anything now.

I'm asking more from the perspective of things that might tickle bugs in tox - I wouldn't expect you to dive too deeply into trial to understand it before we have at least a vague idea of what's going on here :).

I can't think of anything.

After repeated testing, two additional facts to report:

  • this time integration-tests hung as well
  • passing -vvv resulted in only this one additional interesting line of output when I hit control-C:
cleanup /Users/glyph/Projects/${NAME_OF_PROJECT}/.tox/.tmp/package/1/${NAME_OF_PACKAGE}-${VERSION}.zip

I'll do a few more runs with -o to see if I can reproduce it, if the hang is intermittent...

@glyph is the project in question open source so we could poke at it?


Also (unrelated)

(copying from code-quality mailing list)

@sigmavirus24

Hi all,

I noticed https://github.com/tox-dev/tox/issues/1183 and I suspect the
problem is in how tox and trial are using sub-processes for
parallel work. I know that Flake8 uses sub-processes as well (via
multiprocessing) so I'd be unsurprised if this eventually shows up in
Flake8's issue tracker. I don't recall how pylint does parallel
processing, but if it's anything like Flake8, I'm guessing they might
see it soon too for large enough code-bases.

I wanted to give y'all a heads up in case you notice or get reports about this.

Cheers,
Ian

https://github.com/tox-dev/tox/pull/1186 might fix this as a side-effect - let's see once gets merged 馃憤

@asottile Glyph mentioned this was a proprietary codebase in the description of the bug

oh dangit, I missed that :man_facepalming:

I've also been experiencing a similar situation in an (unfortunately!) proprietary codebase, except the offending testrunner is pytest (with pytest-xdist running parallel child workers). I have had success with -o as suggested above, however.

The problem is the use of subprocess.PIPE + proc.wait()

This fills up the pipe buffer and then hangs indefinitely

https://github.com/tox-dev/tox/blob/d17fd1ee5622b964d72c1b3467d209a89274e23f/src/tox/session/commands/run/parallel.py#L48

Here's a minimal reproduction:

[tox]
envlist = e1,e2
skipsdist = true

[testenv:e1]
commands =
    python -c '[print("hello world") for _ in range(5000)]'

[testenv:e2]
commands =
    python -c '[print("hello world") for _ in range(5000)]'

The usual fix is to not use PIPE but to write to a temporary file and then read the file when completed

Please try out the branch in #1202 and see if it fixes your issues!

Hooray for open source!

Since this is now on master, I am trying 7a084a1 with tox -p auto.

It worked!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Borda picture Borda  路  4Comments

obestwalter picture obestwalter  路  4Comments

obestwalter picture obestwalter  路  5Comments

gaborbernat picture gaborbernat  路  3Comments

pytoxbot picture pytoxbot  路  3Comments