Tracking issue for latest test failures we noticed in recent PR.
Targets: K64F, NUCLEO_F429ZI, ARCH_PRO, NUCLEO_F207ZG
Compilers: all
First CI test job that had this failures is 3370. One of the latests test logs can be seen here https://github.com/ARMmbed/mbed-os/pull/8335#issuecomment-437728165
@mikisch81 was able to reproduce this on master. I suspected initially networking issue in our CI setup (@studavekar just in case).
@ARMmbed/mbed-os-ipcore Please review
[ ] Question
[ ] Enhancement
[X] Bug
cc @ARMmbed/mbed-os-maintainers
Reproduced it with latest master branch:
> git clone [email protected]:ARMmbed/mbed-os.git
> mbed test -m K64F -t GCC_ARM -n tests-network-interface
...
...
mbedgt: mbed-host-test-runner: started
mbedgt: checking for GCOV data...
mbedgt: test on hardware with target id: 0240000030514e45003520067d7e00351f91000097969900
mbedgt: test suite 'mbed-os-tests-network-interface' ................................................. FAIL in 30.68 sec
test case: 'NETWORKINTERFACE_CONN_DISC_REPEAT' ............................................... SKIPPED in 0.00 sec
test case: 'NETWORKINTERFACE_STATUS' ......................................................... FAIL in 15.09 sec
test case: 'NETWORKINTERFACE_STATUS_GET' ..................................................... SKIPPED in 0.00 sec
test case: 'NETWORKINTERFACE_STATUS_NONBLOCK' ................................................ SKIPPED in 0.00 sec
mbedgt: test case summary: 0 passes, 1 failure
mbedgt: utest test case summary mismatch: utest reported passes and failures miscount!
reported by utest: passes = 0, failures 1)
test case result count: passes = 0, failures 4)
mbedgt: all tests finished!
mbedgt: shuffle seed: 0.3196952986
mbedgt: test suite report:
+--------------+---------------+---------------------------------+--------+--------------------+-------------+
| target | platform_name | test suite | result | elapsed_time (sec) | copy_method |
+--------------+---------------+---------------------------------+--------+--------------------+-------------+
| K64F-GCC_ARM | K64F | mbed-os-tests-network-interface | FAIL | 30.68 | default |
+--------------+---------------+---------------------------------+--------+--------------------+-------------+
mbedgt: test suite results: 1 FAIL
mbedgt: test case report:
+--------------+---------------+---------------------------------+-----------------------------------+--------+--------+---------+--------------------+
| target | platform_name | test suite | test case | passed | failed | result | elapsed_time (sec) |
+--------------+---------------+---------------------------------+-----------------------------------+--------+--------+---------+--------------------+
| K64F-GCC_ARM | K64F | mbed-os-tests-network-interface | NETWORKINTERFACE_CONN_DISC_REPEAT | 0 | 0 | SKIPPED | 0.0 |
| K64F-GCC_ARM | K64F | mbed-os-tests-network-interface | NETWORKINTERFACE_STATUS | 0 | 1 | FAIL | 15.09 |
| K64F-GCC_ARM | K64F | mbed-os-tests-network-interface | NETWORKINTERFACE_STATUS_GET | 0 | 0 | SKIPPED | 0.0 |
| K64F-GCC_ARM | K64F | mbed-os-tests-network-interface | NETWORKINTERFACE_STATUS_NONBLOCK | 0 | 0 | SKIPPED | 0.0 |
+--------------+---------------+---------------------------------+-----------------------------------+--------+--------+---------+--------------------+
mbedgt: test case results: 1 FAIL / 3 SKIPPED
mbedgt: completed in 30.87 sec
mbedgt: exited with code 1
[mbed] ERROR: "mbedgt" returned error code 1.
[mbed] ERROR: Command "mbedgt --test-spec /home/micsch01/dev/spm/BUILD/tests/K64F/GCC_ARM/test_spec.json -n mbed-os-tests-network-interface" in "/home/micsch01/dev/spm"
@mtomczykmobica Noticed you were not part of ipcore, mention separately
Latest test results (completed few mins ago): https://github.com/ARMmbed/mbed-os/pull/7755#issuecomment-437896519
Update.
Going back to the merge commit generated when it was first merged into Mbed OS still fails.
git checkout fef02181
mbed test -m AUTO -t ARM -n tests-network-interface -c -vv
Good news: Master is fine.
Bad news: Austin CI networking is not.
@cmonr it's just CI or Austin network ?
Fyi, the first failure was seen here, but it looks like it was also aborted.
http://mbed-ci-master-2.austin.arm.com:8081/job/test_matrix_raas/3369/
@cmonr it's just CI or Austin network ?
Thinking about it, I don't know. I'd wager it's not just CI, since I was able to easily reproduce it locally.
Don't think it's either, since @0xc0170 was also able to reproduce the problem, but he's just a tad bit not in Austin.
Another update.
Re-ran tests while of the internal network, and the issue is still present. I suspect that ssomething external the the test depends on recently changed.
Internal Jira reference: https://jira.arm.com/browse/MBOCUSTRIA-160
Running verbosely shows an assert fails with -3010, a NSAPI_ERROR_DHCP_FAILURE.
<snip>
[mbed-10628] Detected "NUCLEO_F429ZI" connected to "E:" and using com port "COM23"
[mbed-10628] Exec "mbedgt --test-spec .\BUILD\tests\NUCLEO_F429ZI\ARM\test_spec.json -n tests-network-interface -V" in "C:\Users\root\i.8715\mbed-os"
mbedgt: greentea test automation tool ver. 1.4.0
mbedgt: test specification file '.\BUILD\tests\NUCLEO_F429ZI\ARM\test_spec.json' (specified with --test-spec option)
mbedgt: using '.\BUILD\tests\NUCLEO_F429ZI\ARM\test_spec.json' from current directory!
mbedgt: detecting connected mbed-enabled devices...
mbedgt: detected 1 device
+---------------+----------------------+-------------+-------------+--------------------------+
| platform_name | platform_name_unique | serial_port | mount_point | target_id |
+---------------+----------------------+-------------+-------------+--------------------------+
| NUCLEO_F429ZI | NUCLEO_F429ZI[0] | COM23 | E: | 07960221012567073E61F1AC |
+---------------+----------------------+-------------+-------------+--------------------------+
mbedgt: processing target 'NUCLEO_F429ZI' toolchain 'ARM' compatible platforms... (note: switch set to --parallel 1)
+---------------+----------------------+-------------+-------------+--------------------------+
| platform_name | platform_name_unique | serial_port | mount_point | target_id |
+---------------+----------------------+-------------+-------------+--------------------------+
| NUCLEO_F429ZI | NUCLEO_F429ZI[0] | COM23:9600 | E: | 07960221012567073E61F1AC |
+---------------+----------------------+-------------+-------------+--------------------------+
mbedgt: test case filter (specified with -n option)
tests-network-interface
test filtered in 'tests-network-interface'
mbedgt: running 1 test for platform 'NUCLEO_F429ZI' and toolchain 'ARM'
use 1 instance of execution threads for testing
mbedgt: checking for 'host_tests' directory above image directory structure
found 'host_tests' directory in: 'TESTS\host_tests'
mbedgt: selecting test case observer...
calling mbedhtrun: mbedhtrun -m NUCLEO_F429ZI -p COM23:9600 -f "BUILD/tests/NUCLEO_F429ZI/ARM/TESTS/network/interface/interface.bin" -e "TESTS\host_tests" -d E: -c default -t 07960221012567073E61F1AC -r default -C 4 --sync 5 -P 60
mbedgt: mbed-host-test-runner: started
[1542038331.71][HTST][INF] host test executor ver. 1.4.0
[1542038331.71][HTST][INF] copy image onto target...
[1542038331.71][COPY][INF] Waiting up to 60 sec for '07960221012567073E61F1AC' mount point (current is 'E:')...
1 file(s) copied.
[1542038341.74][HTST][INF] starting host test process...
[1542038342.21][CONN][INF] starting connection process...
[1542038342.21][CONN][INF] notify event queue about extra 60 sec timeout for serial port pooling
[1542038342.21][CONN][INF] initializing serial port listener...
[1542038342.21][PLGN][INF] Waiting up to 60 sec for '07960221012567073E61F1AC' serial port (current is 'COM23')...
[1542038342.23][SERI][INF] serial(port=COM23, baudrate=9600, read_timeout=0.01, write_timeout=5)
[1542038342.23][SERI][INF] reset device using 'default' plugin...
[1542038342.24][HTST][INF] setting timeout to: 60 sec
[1542038342.48][SERI][INF] waiting 1.00 sec after reset
[1542038343.48][SERI][INF] wait for it...
[1542038343.51][SERI][TXD] mbedmbedmbedmbedmbedmbedmbedmbedmbedmbed
[1542038343.51][CONN][INF] sending up to 5 __sync packets (specified with --sync=5)
[1542038343.51][CONN][INF] sending preamble 'b0eeac0a-7058-44bc-99ee-53feda14b828'
[1542038343.56][SERI][TXD] {{__sync;b0eeac0a-7058-44bc-99ee-53feda14b828}}
[1542038343.62][CONN][RXD] mbedmbedmbedmbedmbedmbedmbedmbed
[1542038343.67][CONN][INF] found SYNC in stream: {{__sync;b0eeac0a-7058-44bc-99ee-53feda14b828}} it is #0 sent, queued...
[1542038343.68][HTST][INF] sync KV found, uuid=b0eeac0a-7058-44bc-99ee-53feda14b828, timestamp=1542038343.666000
[1542038343.68][CONN][INF] found KV pair in stream: {{__version;1.3.0}}, queued...
[1542038343.70][HTST][INF] DUT greentea-client version: 1.3.0
[1542038343.71][CONN][INF] found KV pair in stream: {{__timeout;480}}, queued...
[1542038343.73][HTST][INF] setting timeout to: 480 sec
[1542038343.74][CONN][INF] found KV pair in stream: {{__host_test_name;default_auto}}, queued...
[1542038343.76][HTST][INF] host test class: '<class 'mbed_host_tests.host_tests.default_auto.DefaultAuto'>'
[1542038343.76][HTST][INF] host test setup() call...
[1542038343.76][HTST][INF] CALLBACKs updated
[1542038343.76][HTST][INF] host test detected: default_auto
[1542038343.78][CONN][RXD] >>> Running 4 test cases...
[1542038343.82][CONN][INF] found KV pair in stream: {{__testcase_name;NETWORKINTERFACE_STATUS}}, queued...
[1542038343.87][CONN][INF] found KV pair in stream: {{__testcase_name;NETWORKINTERFACE_STATUS_NONBLOCK}}, queued...
[1542038343.93][CONN][INF] found KV pair in stream: {{__testcase_name;NETWORKINTERFACE_STATUS_GET}}, queued...
[1542038343.98][CONN][RXD]
[1542038343.98][CONN][INF] found KV pair in stream: {{__testcase_name;NETWORKINTERFACE_CONN_DISC_REPEAT}}, queued...
[1542038344.04][CONN][RXD] >>> Running case #1: 'NETWORKINTERFACE_STATUS'...
[1542038344.09][CONN][INF] found KV pair in stream: {{__testcase_start;NETWORKINTERFACE_STATUS}}, queued...
mbedgt: :82::FAIL: Expected 0 Was -3010
[1542038405.94][CONN][RXD] :82::FAIL: Expected 0 Was -3010
[1542038406.01][CONN][INF] found KV pair in stream: {{__testcase_finish;NETWORKINTERFACE_STATUS;0;1}}, queued...
[1542038406.08][CONN][RXD] >>> 'NETWORKINTERFACE_STATUS': 0 passed, 1 failed with reason 'Assertion Failed'
[1542038406.08][CONN][RXD]
[1542038406.16][CONN][RXD] >>> Test cases: 0 passed, 1 failed with reason 'Assertion Failed'
[1542038406.18][CONN][RXD] >>> TESTS FAILED!
[1542038406.21][CONN][INF] found KV pair in stream: {{__testcase_summary;0;1}}, queued...
[1542038406.22][CONN][INF] found KV pair in stream: {{max_heap_usage;0}}, queued...
[1542038406.24][HTST][ERR] orphan event in main phase: {{max_heap_usage;0}}, timestamp=1542038406.224000
[1542038406.26][CONN][INF] found KV pair in stream: {{reserved_heap;0}}, queued...
[1542038406.27][HTST][ERR] orphan event in main phase: {{reserved_heap;0}}, timestamp=1542038406.256000
[1542038406.27][CONN][INF] found KV pair in stream: {{end;failure}}, queued...
[1542038406.29][CONN][INF] found KV pair in stream: {{__exit;0}}, queued...
[1542038406.29][HTST][INF] __exit(0)
[1542038406.29][HTST][INF] __notify_complete(False)
[1542038406.29][HTST][INF] __exit_event_queue received
[1542038406.29][HTST][INF] test suite run finished after 62.56 sec...
[1542038406.30][CONN][INF] received special event '__host_test_finished' value='True', finishing
[1542038406.35][HTST][INF] CONN exited with code: 0
[1542038406.35][HTST][INF] Some events in queue
[1542038406.35][HTST][INF] stopped consuming events
[1542038406.35][HTST][INF] host test result() call skipped, received: False
[1542038406.35][HTST][INF] calling blocking teardown()
[1542038406.35][HTST][INF] teardown() finished
[1542038406.35][HTST][INF] {{result;failure}}
mbedgt: retry mbedhtrun 1/1
mbedgt: ['mbedhtrun', '-m', u'NUCLEO_F429ZI', '-p', u'COM23:9600', '-f', u'"BUILD/tests/NUCLEO_F429ZI/ARM/TESTS/network/interface/interface.bin"', '-e', u'"TESTS\\host_tests"', '-d', 'E:', '-c', 'default', '-t', u'07960221012567073E61F1AC', '-r', 'default', '-C', '4', '--sync', '5', '-P', '60'] failed after 1 count
mbedgt: checking for GCOV data...
mbedgt: mbed-host-test-runner: stopped and returned 'FAIL'
mbedgt: test on hardware with target id: 07960221012567073E61F1AC
mbedgt: test suite 'tests-network-interface' ......................................................... FAIL in 75.15 sec
test case: 'NETWORKINTERFACE_CONN_DISC_REPEAT' ............................................... SKIPPED in 0.00 sec
test case: 'NETWORKINTERFACE_STATUS' ......................................................... FAIL in 61.92 sec
test case: 'NETWORKINTERFACE_STATUS_GET' ..................................................... SKIPPED in 0.00 sec
test case: 'NETWORKINTERFACE_STATUS_NONBLOCK' ................................................ SKIPPED in 0.00 sec
mbedgt: test case summary: 0 passes, 1 failure
mbedgt: utest test case summary mismatch: utest reported passes and failures miscount!
reported by utest: passes = 0, failures 1)
test case result count: passes = 0, failures 4)
mbedgt: all tests finished!
mbedgt: shuffle seed: 0.7826281322
mbedgt: test suite report:
+-------------------+---------------+-------------------------+--------+--------------------+-------------+
| target | platform_name | test suite | result | elapsed_time (sec) | copy_method |
+-------------------+---------------+-------------------------+--------+--------------------+-------------+
| NUCLEO_F429ZI-ARM | NUCLEO_F429ZI | tests-network-interface | FAIL | 75.15 | default |
+-------------------+---------------+-------------------------+--------+--------------------+-------------+
mbedgt: test suite results: 1 FAIL
mbedgt: test case report:
+-------------------+---------------+-------------------------+-----------------------------------+--------+--------+---------+--------------------+
| target | platform_name | test suite | test case | passed | failed | result | elapsed_time (sec) |
+-------------------+---------------+-------------------------+-----------------------------------+--------+--------+---------+--------------------+
| NUCLEO_F429ZI-ARM | NUCLEO_F429ZI | tests-network-interface | NETWORKINTERFACE_CONN_DISC_REPEAT | 0 | 0 | SKIPPED | 0.0 |
| NUCLEO_F429ZI-ARM | NUCLEO_F429ZI | tests-network-interface | NETWORKINTERFACE_STATUS | 0 | 1 | FAIL | 61.92 |
| NUCLEO_F429ZI-ARM | NUCLEO_F429ZI | tests-network-interface | NETWORKINTERFACE_STATUS_GET | 0 | 0 | SKIPPED | 0.0 |
| NUCLEO_F429ZI-ARM | NUCLEO_F429ZI | tests-network-interface | NETWORKINTERFACE_STATUS_NONBLOCK | 0 | 0 | SKIPPED | 0.0 |
+-------------------+---------------+-------------------------+-----------------------------------+--------+--------+---------+--------------------+
mbedgt: test case results: 1 FAIL / 3 SKIPPED
mbedgt: completed in 76.78 sec
mbedgt: exited with code 1
[mbed-10628] ERROR: "mbedgt" returned error.
Code: 1
Path: "C:\Users\root\i.8715\mbed-os"
Command: "mbedgt --test-spec .\BUILD\tests\NUCLEO_F429ZI\ARM\test_spec.json -n tests-network-interface -V"
Tip: You could retry the last command with "-v" flag for verbose output
---
Stepping back and wondering if this is still an ARM-network thing.
The non-ARM-network testcase that I ran was via 1) Windows network bridging, and 2) phone tethering as an AP.
There's no telling else that test setup might have blocked.
@mikisch81 Do you think you can retest? It _seems_ that the issue has been resolved.
We're waiting for someone to come back from lunch to figure out the why and how.
@cmonr not in the near future as I am no longer in the office..
@mikisch81 That's alright, when you get time.
The leading theory appears to be that something strange happened with DHCP requests, and letting the leases expire caused things to start working again.
Will keep this open for now, but it seems that the problem resolved itself :/
Interesting. Problem has come back.
[1542038344.09][CONN][INF] found KV pair in stream: {{__testcase_start;NETWORKINTERFACE_STATUS}}, queued...
mbedgt: :82::FAIL: Expected 0 Was -3010
[1542038405.94][CONN][RXD] :82::FAIL: Expected 0 Was -3010
NSAPI_ERROR_DHCP_FAILURE = -3010, /*!< DHCP failed to complete successfully */
Closing this since the problem seems to have _frustratingly_ gone away on its own...
Steadily working through the CI backlog.
opening this as @cmonr is seeing this issue on desk as well in CI.
I did a bit more digging into this since I could reproduce it with a K64F on my desk. I found that the DHCP failure is originating from a failure to acquire a semaphore: https://github.com/ARMmbed/mbed-os/blob/9d95d46d6cb409df37675a2dfe35bc43c8d67c83/features/lwipstack/LWIPInterface.cpp#L550-L554
That's why the time between the testcase start and failure is almost exactly 60 seconds, the semaphore acquire is just timing out.
I also managed to acquire wireshark captures (from my host PC, which has the broadcase ARP and DHCP messages) from both failing test cases and passing cases so we can compare them. I can't supply those here (security and all that) but I can supply them to Arm folks through email. @cmonr also has the logs.
Fyi, a service ticket to Austin IT has been sent with high priority, with people CCed. We need to look at what the DHCP server is sending in response to continue.
I also have a adapter which should show up tomorrow to wireshark in between a line, which should also help continue debugging this issue. For reference, it appears that the issue has settled _again_ for the time being.
@ARMmbed/mbed-os-ipcore Any suggestions ?
It is not a failure to acquire a semaphore, but a timeout.
So either the stack does not get DHCP response, or the network eats it, or the driver drops it, or somehow the stack fails to open the mutex.
I have the Wireshark log that looks very much that there is no DHCP response, but the capture file is not complete.
It is not a failure to acquire a semaphore, but a timeout.
Oh yup, good point. definitely different behaviors! I wasn't being very precise.
So either the stack does not get DHCP response, or the network eats it, or the driver drops it, or somehow the stack fails to open the mutex.
My thoughts as well. Seems like the next step is to watch the traffic going from the DHCP server to the device (the offer and acknowledge packets) and see what's going on there. @cmonr has some equipment on order to help with that.
Email has been sent with several captures.
Of note, between tests three and four, the device was power cycled.