Incubator-mxnet: Flaky Tests Tracking Issue

Created on 13 Jan 2018  Â·  17Comments  Â·  Source: apache/incubator-mxnet

What

Thanks to community members, we identified tests that are flaky and need fixing. I'm putting together the list of open issues for tracking them and calling for help on fixing them.

We use this issue for tracking progress and coordinating efforts.

TODO

| Issue | Requester | Category | Cause | Status |
|-------|------------------|----------------|--------------------------------------|--------|
| #7645 | @rahul003 | Operator | numerical stability | Disabled |
| #8211 | @indhub | Autograd | autograd memory footprint | Disabled |
| #8230 | @indhub | Operator | numerical stability | Disabled |
| #8283 | @indhub | Utility | external dependency | Fixed #9503 |
| #8288 | @indhub | Operator | numerical stability (?) | Disabled |
| #8299 | @indhub | Operator | testing through training. randomness | Disabled |
| #8892 | @marcoabreu | Operator | testing through training. randomness | Disabled |
| #8934 | @marcoabreu | Operator | segfault in MKL version | Disabled |
| #9295 | @marcoabreu | Operator | laop hangs in MKL version | Disabled |
| #9384 | @eric-haibin-lin | Sparse/KVStore | segfault for sparse | Disabled |
| #8834 | @marcoabreu | Scala Operator | numerical stability | Disabled |
| #9415 | @sergeykolychev | Perl | segfault in gluon rnn | Disabled |
| #9669 | @KellenSunderland | Python | external dependency | Flaky |
| #9649 | @KellenSunderland | Mem | test timer | Flaky |
| #10087 | @anirudhacharya | Operator | precision | Flaky |

Completed

| Issue | Requester | Category | Cause | Status |
|-------|------------------|----------------|--------------------------------------|--------|
| #9604 | @zhreshold | Python | external dependencies | Fixed #9620 |
| #8928 | @marcoabreu | Perl | CPU segfault | Fixed #9414 |
| #9332 | @KellenSunderland | R | external dependency | Fixed #9598 |
| #9553 | @marcoabreu | Operator | need investigation | Fixed #9581 |

Meaning of status:

  • Flaky: test is enabled and flaky, and is impacting CI.
  • Disabled: temporarily disabled after discovery. Fix is needed.
  • Fixed: fix has finished and test is no longer flaky.
  • @someone: @someone is fixing the test.

How

To add new flaky test that was discovered

Create a new issue for the test, and comment here and refer the new issue.

To help fixing the tests

Pick an issue that hasn't been taken. Comment here that you are working on which issue, and I will update the status in the table. Then start working on the issue, and put details, findings and resolutions in the original issue. Also, a good resource for understanding the issue is the people who wrote the feature and the tests. As such, we can identify them from the commit history and ping them for help.

Requester of the original issue, as well as @apache/mxnet-committers should make sure that as a result of the fix, the tests are:

  • Reliably passing with good coverage.
  • Avoid randomness unless necessary.
  • Avoid external dependency unless necessary (e.g. due to license).
  • Root-cause is found and fixed if it's actually a problem in code base.
  • Not resource-intensive unless necessary (e.g. scaling tests).

Reference

Discussions on dev

Call for Contribution Flaky Test

Most helpful comment

@szha merged https://github.com/apache/incubator-mxnet/pull/9414 that should address two flaky perl tests and small change for viz that happened upstream recently.

All 17 comments

I am taking #8928 (Perl issue)

On Sat, Jan 13, 2018 at 1:44 PM, Sheng Zha notifications@github.com wrote:

What

Thanks to community members, we identified tests that are flaky and need
fixing. I'm putting together the list of open issues for tracking them and
calling for help on fixing them.

We use this issue for tracking progress and coordinating efforts.
Issue Requester Category Cause Status

7645 https://github.com/apache/incubator-mxnet/issues/7645 @rahul003

https://github.com/rahul003 Operator numerical stability Disabled

8211 https://github.com/apache/incubator-mxnet/issues/8211 @indhub

https://github.com/indhub Autograd autograd memory footprint Disabled

8230 https://github.com/apache/incubator-mxnet/issues/8230 @indhub

https://github.com/indhub Operator numerical stability Disabled

8283 https://github.com/apache/incubator-mxnet/issues/8283 @indhub

https://github.com/indhub Utility external dependency Disabled

8288 https://github.com/apache/incubator-mxnet/issues/8288 @indhub

https://github.com/indhub Operator numerical stability (?) Disabled

8299 https://github.com/apache/incubator-mxnet/issues/8299 @indhub

https://github.com/indhub Operator testing through training. randomness
Disabled

8892 https://github.com/apache/incubator-mxnet/issues/8892 @marcoabreu

https://github.com/marcoabreu Operator testing through training.
randomness Disabled

8934 https://github.com/apache/incubator-mxnet/issues/8934 @marcoabreu

https://github.com/marcoabreu Operator segfault in MKL version Disabled

9295 https://github.com/apache/incubator-mxnet/issues/9295 @marcoabreu

https://github.com/marcoabreu Operator laop hangs in MKL version
Disabled

9384 https://github.com/apache/incubator-mxnet/issues/9384

@eric-haibin-lin https://github.com/eric-haibin-lin Sparse/KVStore segfault
for sparse Disabled

8834 https://github.com/apache/incubator-mxnet/issues/8834 @marcoabreu

https://github.com/marcoabreu Scala Operator numerical stability
Disabled

8928 https://github.com/apache/incubator-mxnet/issues/8928 @marcoabreu

https://github.com/marcoabreu Perl CPU segfault Disabled

9332 https://github.com/apache/incubator-mxnet/issues/9332

@KellenSunderland https://github.com/kellensunderland R external
dependency Disabled

Meaning of status:

  • Disabled: temporarily disabled after discovery. Fix is needed.
  • Flaky: test is enabled with retries. Fix is still needed.
  • Fixed: fix has finished and test is no longer flaky.

To add new flaky test that was discovered

Create a new issue for the test, and comment here and refer the new issue.
To help fixing the tests

Pick an issue that hasn't been taken. Comment here that you are working on
which issue, and I will update the status in the table. Then start working
on the issue, and put details, findings and resolutions in the original
issue
.

Requester of the original issue, as well as @apache/mxnet-committers
https://github.com/orgs/apache/teams/mxnet-committers should make sure
that as a result of the fix, the tests are:

  • Reliably passing
  • Avoid randomness if possible
  • Avoid external dependency if possible
  • Root-cause is found and fixed if it's actually a problem in code
    base.
  • Not resource-intensive

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/9412, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AYSk2P_RWjRZM2O4lzS6fG4_RCkn88x3ks5tKSOxgaJpZM4RdYTd
.

@szha merged https://github.com/apache/incubator-mxnet/pull/9414 that should address two flaky perl tests and small change for viz that happened upstream recently.

A few related PRs for reference on randomness problem in CI tests:

  1. https://github.com/apache/incubator-mxnet/pull/8313
  2. https://github.com/apache/incubator-mxnet/pull/8526

Info from @szha: https://pypi.python.org/pypi/flaky

See also: Email thread on dev@ titled: "Improving and rationalizing unit tests" and "Call for Help for Fixing Flaky Tests"

@szha It may help to add the above information to the above list for easy reference.

@bhavinthaker good suggestions. I added these references.

Working on #8283

I fixed #8283 in #9503.

#9581 fixes #9553

9598 fixes #9332

Two more to be tracked: #9669 and #9649.

test_layer_norm has precision issues - https://github.com/apache/incubator-mxnet/issues/10114

@eric-haibin-lin is there an open issue for test_correlation?

@szha This problem was solved once by #9581 in the past but popped up again due to my recent changes to the operator to support all float data types, I'll do a deep dive on this issue today and get a possible fix.

I'm always adding them to https://github.com/apache/incubator-mxnet/projects/9#card-6995282 - do I have to call them out here as well?

I've partially disabled the test test_op_output_names_monitor in this PR https://github.com/apache/incubator-mxnet/pull/10342 as it's causing long hangs in our CI server. Tracked in issue: https://github.com/apache/incubator-mxnet/issues/10341

We now use github project functionality for issue tracking. https://github.com/apache/incubator-mxnet/projects/9

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuconglin picture yuconglin  Â·  3Comments

xzqjack picture xzqjack  Â·  3Comments

Ajoo picture Ajoo  Â·  3Comments

realbns2008 picture realbns2008  Â·  3Comments

ranti-iitg picture ranti-iitg  Â·  3Comments