Incubator-mxnet: I found a bug in the source code, I don't know how to define it, but I commented out that the code will run better.

Created on 29 Dec 2018  Â·  7Comments  Â·  Source: apache/incubator-mxnet

I found a bug in the source code, I don't know how to define it, but I commented out that the code will run better.

When I used C++ to call Python's model, a bug about cuda appeared in the c++ forward reasoning process. This bug is similar to:

`[11:25:55] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[11:25:55] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
Best Result: pomegranate (id=957, accuracy=0.45423976)
terminate called after throwing an instance of 'dmlc::Error'
what(): [11:26:02] src/common/../common/cuda_utils.h:296: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid device ordinal

This is just the case with gpu。

I commented out /common/cuda_utils.h: 296 lines and 295 lines. Because I think the author used cudaSetDevice in the destructor, I don't understand it very much and it caused the error I encountered. After commenting out, the program works fine.
~DeviceStore() { // if (restore_) // CUDA_CALL(cudaSetDevice(restore_device_)); }

Bug CUDA

Most helpful comment

I guess what happened was cudaGetDevice was called after the driver started unloading and so it did not populate the restore_device_ properly. I have a PR #13764 that changes the DeviceGuard already, I can fix this problem there.

All 7 comments

@mxnet-label-bot add [Bug, CUDA]
@l1uw3n Thank you for reporting the issue.

I guess what happened was cudaGetDevice was called after the driver started unloading and so it did not populate the restore_device_ properly. I have a PR #13764 that changes the DeviceGuard already, I can fix this problem there.

@l1uw3n Seems like the related PR got merged. Was your issue resolved ? If yes, please close this issue.
Please feel free to re-open if closed in error.
Thanks!

@l1uw3n Can you please confirm if PR #13764 fixes this issue?
If not, can you please provide reproducible steps to this issue so we can help resolve it.
Thanks!

@l1uw3n looks like folks are trying to help, but without a follow up from your end, and/or a reproduction scenario, it would be hard to help out.
Please let us know whether this still happens, or is it fixed with @ptrendx code change.
Feel free to close if not reproducing. Thanks!

I think this bug can be closed due to #13764 and the lack of repro scenario and response from the submitter of the issue (@l1uw3n). If closed in error this can be re-opened.
@anirudh2290 can you please review and unless I am missing anything close?

Fixed by #13764

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dushoufu picture dushoufu  Â·  3Comments

phunterlau picture phunterlau  Â·  3Comments

yuconglin picture yuconglin  Â·  3Comments

dmadeka picture dmadeka  Â·  3Comments

Ajoo picture Ajoo  Â·  3Comments