Thanks to community members, we identified tests that are flaky and need fixing. I'm putting together the list of open issues for tracking them and calling for help on fixing them.
We use this issue for tracking progress and coordinating efforts.
| Issue | Requester | Category | Cause | Status |
|-------|------------------|----------------|--------------------------------------|--------|
| #7645 | @rahul003 | Operator | numerical stability | Disabled |
| #8211 | @indhub | Autograd | autograd memory footprint | Disabled |
| #8230 | @indhub | Operator | numerical stability | Disabled |
| #8283 | @indhub | Utility | external dependency | Fixed #9503 |
| #8288 | @indhub | Operator | numerical stability (?) | Disabled |
| #8299 | @indhub | Operator | testing through training. randomness | Disabled |
| #8892 | @marcoabreu | Operator | testing through training. randomness | Disabled |
| #8934 | @marcoabreu | Operator | segfault in MKL version | Disabled |
| #9295 | @marcoabreu | Operator | laop hangs in MKL version | Disabled |
| #9384 | @eric-haibin-lin | Sparse/KVStore | segfault for sparse | Disabled |
| #8834 | @marcoabreu | Scala Operator | numerical stability | Disabled |
| #9415 | @sergeykolychev | Perl | segfault in gluon rnn | Disabled |
| #9669 | @KellenSunderland | Python | external dependency | Flaky |
| #9649 | @KellenSunderland | Mem | test timer | Flaky |
| #10087 | @anirudhacharya | Operator | precision | Flaky |
| Issue | Requester | Category | Cause | Status |
|-------|------------------|----------------|--------------------------------------|--------|
| #9604 | @zhreshold | Python | external dependencies | Fixed #9620 |
| #8928 | @marcoabreu | Perl | CPU segfault | Fixed #9414 |
| #9332 | @KellenSunderland | R | external dependency | Fixed #9598 |
| #9553 | @marcoabreu | Operator | need investigation | Fixed #9581 |
Meaning of status:
Create a new issue for the test, and comment here and refer the new issue.
Pick an issue that hasn't been taken. Comment here that you are working on which issue, and I will update the status in the table. Then start working on the issue, and put details, findings and resolutions in the original issue. Also, a good resource for understanding the issue is the people who wrote the feature and the tests. As such, we can identify them from the commit history and ping them for help.
Requester of the original issue, as well as @apache/mxnet-committers should make sure that as a result of the fix, the tests are:
Call for Help for Fixing Flaky Tests
I am taking #8928 (Perl issue)
On Sat, Jan 13, 2018 at 1:44 PM, Sheng Zha notifications@github.com wrote:
What
Thanks to community members, we identified tests that are flaky and need
fixing. I'm putting together the list of open issues for tracking them and
calling for help on fixing them.We use this issue for tracking progress and coordinating efforts.
Issue Requester Category Cause Status7645 https://github.com/apache/incubator-mxnet/issues/7645 @rahul003
https://github.com/rahul003 Operator numerical stability Disabled
8211 https://github.com/apache/incubator-mxnet/issues/8211 @indhub
https://github.com/indhub Autograd autograd memory footprint Disabled
8230 https://github.com/apache/incubator-mxnet/issues/8230 @indhub
https://github.com/indhub Operator numerical stability Disabled
8283 https://github.com/apache/incubator-mxnet/issues/8283 @indhub
https://github.com/indhub Utility external dependency Disabled
8288 https://github.com/apache/incubator-mxnet/issues/8288 @indhub
https://github.com/indhub Operator numerical stability (?) Disabled
8299 https://github.com/apache/incubator-mxnet/issues/8299 @indhub
https://github.com/indhub Operator testing through training. randomness
Disabled8892 https://github.com/apache/incubator-mxnet/issues/8892 @marcoabreu
https://github.com/marcoabreu Operator testing through training.
randomness Disabled8934 https://github.com/apache/incubator-mxnet/issues/8934 @marcoabreu
https://github.com/marcoabreu Operator segfault in MKL version Disabled
9295 https://github.com/apache/incubator-mxnet/issues/9295 @marcoabreu
https://github.com/marcoabreu Operator laop hangs in MKL version
Disabled9384 https://github.com/apache/incubator-mxnet/issues/9384
@eric-haibin-lin https://github.com/eric-haibin-lin Sparse/KVStore segfault
for sparse Disabled8834 https://github.com/apache/incubator-mxnet/issues/8834 @marcoabreu
https://github.com/marcoabreu Scala Operator numerical stability
Disabled8928 https://github.com/apache/incubator-mxnet/issues/8928 @marcoabreu
https://github.com/marcoabreu Perl CPU segfault Disabled
9332 https://github.com/apache/incubator-mxnet/issues/9332
@KellenSunderland https://github.com/kellensunderland R external
dependency DisabledMeaning of status:
- Disabled: temporarily disabled after discovery. Fix is needed.
- Flaky: test is enabled with retries. Fix is still needed.
- Fixed: fix has finished and test is no longer flaky.
To add new flaky test that was discovered
Create a new issue for the test, and comment here and refer the new issue.
To help fixing the testsPick an issue that hasn't been taken. Comment here that you are working on
which issue, and I will update the status in the table. Then start working
on the issue, and put details, findings and resolutions in the original
issue.Requester of the original issue, as well as @apache/mxnet-committers
https://github.com/orgs/apache/teams/mxnet-committers should make sure
that as a result of the fix, the tests are:
- Reliably passing
- Avoid randomness if possible
- Avoid external dependency if possible
- Root-cause is found and fixed if it's actually a problem in code
base.- Not resource-intensive
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/9412, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AYSk2P_RWjRZM2O4lzS6fG4_RCkn88x3ks5tKSOxgaJpZM4RdYTd
.
@szha merged https://github.com/apache/incubator-mxnet/pull/9414 that should address two flaky perl tests and small change for viz that happened upstream recently.
A few related PRs for reference on randomness problem in CI tests:
Info from @szha: https://pypi.python.org/pypi/flaky
See also: Email thread on dev@ titled: "Improving and rationalizing unit tests" and "Call for Help for Fixing Flaky Tests"
@szha It may help to add the above information to the above list for easy reference.
@bhavinthaker good suggestions. I added these references.
Working on #8283
I fixed #8283 in #9503.
#9581 fixes #9553
Two more to be tracked: #9669 and #9649.
One more issue to be tracked - https://github.com/apache/incubator-mxnet/issues/10087
test_layer_norm has precision issues - https://github.com/apache/incubator-mxnet/issues/10114
@eric-haibin-lin is there an open issue for test_correlation?
@szha This problem was solved once by #9581 in the past but popped up again due to my recent changes to the operator to support all float data types, I'll do a deep dive on this issue today and get a possible fix.
I'm always adding them to https://github.com/apache/incubator-mxnet/projects/9#card-6995282 - do I have to call them out here as well?
I've partially disabled the test test_op_output_names_monitor in this PR https://github.com/apache/incubator-mxnet/pull/10342 as it's causing long hangs in our CI server. Tracked in issue: https://github.com/apache/incubator-mxnet/issues/10341
We now use github project functionality for issue tracking. https://github.com/apache/incubator-mxnet/projects/9
Most helpful comment
@szha merged https://github.com/apache/incubator-mxnet/pull/9414 that should address two flaky perl tests and small change for viz that happened upstream recently.