Lightning: "make check" is slow

Created on 23 Jul 2018 · 9Comments · Source: ElementsProject/lightning

Issue and Steps to Reproduce

Run make check or even make -j.. check. Observe how long it takes to finish the tests.

Notes

I don't think it's broken, but the way it is now, it's really annoying.

I have a 12-core system, but the Python tests seem to be mostly single-threaded. If we could run different tests in parallel on different cores, that could really speed up the test suite. Of course, care must be taken that tests don't interfere with each other, e.g. on the file system or in binding to port numbers.

Are there any ideas on how to do this?

testing

Source

bitonic-cjp

Most helpful comment

With 2 cores I regularly use PYTEST_PAR=10 since we spend a lot of time just waiting for timers to time out, or blocks to be propagated. Disabling valgrind is another important one, and should be safe since we run it on travis on every PR.

cdecker on 24 Jul 2018

👍2

All 9 comments

We just updated the testing section of docs/HACKING.md to address this (#1725). Basically once you install:
pip3 install pytest-xdist

you can then run something like:
make -j12 check PYTEST_PAR=24 DEVELOPER=1 VALGRIND=0

Adjust PYTEST_PAR for your hardware. My testing shows its mostly memory dependent.

wythe on 23 Jul 2018

👍2

cdecker on 24 Jul 2018

👍2

Some tests fail when running in parallel but pass when run individually, for example using:
make -j6 check PYTEST_PAR=12 DEVELOPER=1 VALGRIND=0 results in

[gw4] [ 28%] FAILED tests/test_lightningd.py::LightningDTests::test_closing_while_disconnected 

===================================================================================================== FAILURES ======================================================================================================
__________________________________________________________________________________ LightningDTests.test_closing_while_disconnected __________________________________________________________________________________
[gw4] linux -- Python 3.5.3 /usr/bin/python3

But make -j12 check PYTEST_PAR=1 DEVELOPER=1 VALGRIND=0 passes all tests

===================================================================================== 111 passed, 1 skipped in 2350.97 seconds ======================================================================================

Running the test separately also passes:

DEVELOPER=1 VALGRIND=1 PYTHONPATH=contrib/pylightning python3 tests/test_lightningd.py -f LightningDTests.test_closing_while_disconnected
test_closing_while_disconnected (__main__.LightningDTests) ... /home/simon/.local/lib/python3.5/site-packages/ephemeral_port_reserve.py:47: ResourceWarning: unclosed <socket.socket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 33233), raddr=('127.0.0.1', 49782)>
  s.accept()
/home/simon/.local/lib/python3.5/site-packages/ephemeral_port_reserve.py:47: ResourceWarning: unclosed <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 43307), raddr=('127.0.0.1', 59276)>
  s.accept()
/home/simon/.local/lib/python3.5/site-packages/ephemeral_port_reserve.py:47: ResourceWarning: unclosed <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 42253), raddr=('127.0.0.1', 47728)>
  s.accept()
ok

----------------------------------------------------------------------
Ran 1 test in 82.740s

OK

Another thing I noticed is that _after_ a (presumably failed) test, ps -ax | grep bitcoind still shows a long list of bitcoind processes.

SimonVrouwe on 26 Jul 2018

After calling pkill -f bitcoind and running tests with different parameter PYTEST_PAR=6
make -j6 check PYTEST_PAR=6 DEVELOPER=1 VALGRIND=0 now fails on a _different_ test

[gw2] [ 36%] FAILED tests/test_lightningd.py::LightningDTests::test_fundee_forget_funding_tx_unconfirmed 

===================================================================================================== FAILURES ======================================================================================================
_____________________________________________________________________________ LightningDTests.test_fundee_forget_funding_tx_unconfirmed _____________________________________________________________________________

and using make -j6 check PYTEST_PAR=3 DEVELOPER=1 VALGRIND=0 passes all tests and now ps -ax | grep bitcoind is empty afterward. My system is 4 core i5-7200U CPU @ 2.50GHz and 4GB RAM.

So I guess some tests fail when you set PYTEST_PAR larger then number of cores?

SimonVrouwe on 26 Jul 2018

I mentioned using PYTEST_PAR of 24. But this is on a 6 core i7 desktop with 64GB RAM. I still get the occasional intermittent failure, so I just rerun.

I think it has more to do with RAM size than core count. Swapping RAM leads to timeout errors.

The docs can be further improved to address these issues.

wythe on 26 Jul 2018

Setting PYTEST_PAR to a higher value definitely speeds things up, but my experience so far:

setting PYTEST_PAR > 3 on my 4-core machine fails some tests
some processes are not properly closed/killed/cleaned-up after failed tests (bitcoind, valgrind,...)
~~any failed test requires system reboot to pass eventually~~

my system: Debian stretch 9.4

EDIT: using PYTEST_PAR=5 PASSED and reboot is not always needed. So it remains a bit of a mystery to me. FWIW, 3 out of 4 tests using different PYTEST_PAR values failed on this test:
test_fundee_forget_funding_tx_unconfirmed so maybe that is the culprit.

SimonVrouwe on 30 Jul 2018

Since we have PYTEST_PAR, this can be closed right?

jb55 on 31 Jul 2018

sure

SimonVrouwe on 31 Jul 2018

I'll also work on speeding these up a bit more :-)
Another trick is to run these on a ramdisc by setting TEST_DIR like this:

TEST_DIR=/dev/shm/ltest PYTEST_PAR=10 make pytest

On my machine this results in the following timings:

On the other hand the ramdisc doesn't seem to do all that much...

cdecker on 1 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Updating database from version 37 to 50 db_migrate fails

brunoaduarte · 5Comments

v0.7.0 needs a name and @wythe doesn't read his email

rustyrussell · 4Comments

When Windows Versions?

SPIRY-RO · 4Comments

How to force remove (forget) a channel

igreshev · 4Comments

Avoid leaking bitcoin auth secrets to other users

jonasnick · 3Comments