Incubator-mxnet: v1.0 Stable Release TODO List

Created on 6 Aug 2016 · 32Comments · Source: apache/incubator-mxnet

It's about time for a feature complete stable release.

We are in the process of a major refactor. While most changes are in backend side and therefore should not significantly affect users, we do expect to break a few little things and maybe compatibility with other language bindings.
So authors of Julia, R, Scala, etc, package please stay tuned and adopt the new API. It should be a quick fix and we will have guide for the transition.
@thirdwing @pluskid @vchuravy @Ldpe2G

Transition Guide/List of Breaking Changes:

Developer

TBlob and TShape are moved from mshadow namespace to mxnet namespace. Fix: Change mshadow::TBlob and mshadow::TShape to TBlob and TShape in your code.
Please do not use cudaMalloc & cudaFree directly anywhere in MXNet. Use Storage::Get()->Alloc(size, Context::GPU()) to allocate memory to current GPU instead.

User

If you were training networks with BatchNormalization layer on CPU or on GPU with cudnn v4 or below before Jul 5th, you may find your model outputting totally wrong results after loading it back for testing. The simplest fix is to load your .param files with ndarray.load and set all arrays with key ending with '_gamma' to 1.0 and save them back.

If you load model trained before Dec 2015 for prediction and the model uses BatchNorm, your model may output totally wrong results. This can be fixed by adding fix_gamma=True to all BatchNorm layers in your symbol construction script or adding 'fix_gamma': 'True' to all BatchNorm layers in your .json model file.
sum_axis, max_axis, and min_axis are removed. Please use mx.nd.max(src, axis=n) to do the same thing
element_mask is removed. Please use src*mask.reshape((mask.size, 1, 1, ..., 1)) directly as binary ops now support broadcasting.

TODOs

[ ] Refactor Symbolic graph to use NNVM. @tqchen
1. [x] Finish NNVM and Passes.
2. [ ] Refactor NDArray inferface to use nnvm::op and add cython version. @piiswrong
3. [ ] Set ndarray function naming convention straight.
4. [x] Refactor Executor to use NNVM
[x] Bring in NCCL @mli
1. [ ] ~~Use NCCL reduce and broadcast and fix deadlock bug.~~ NCCL is problematic with our engine
2. [x] Or, write our own ring based P2P reduce
[ ] Better Tests @mli @piiswrong
1. [ ] Setup EC2 test server.
2. [ ] Setup GPU+CPU consistency and gradient check.
3. [ ] Run performance regression test
4. [ ] Test & debug c++ prediction interface
[ ] Sparse @mli @antinucleon
[ ] Better Doc @leopd
1. [ ] Improve doc formatting and readability
2. [ ] Fix confusing language and description.
3. [ ] More tutorials.
4. [ ] Reorganize docs. Put pages where they belong
5. [ ] Improve installation guide. Consider adding a script similar to torch installation script
[ ] Misc
1. [x] Refactor ccoptimizer interface to make writing new ccoptimizers easier. Add ccadam.
2. [ ] Fix memory allocation policy. Explain in doc that you shouldn't use cudamalloc in operators. Use request temp space for tmp memory or pooled allocator for holding states.
3. [ ] ...
[ ] Fix known bugs
1. [ ] Fix CustomOp bug that causes cyclic dependency https://github.com/dmlc/mxnet/issues/2945
[ ] IO doc and refactor
1. [x] Move opencv plugin into main repo and use new ndarray interface.
2. [ ] update IIterator to support mutiple data/label.
3. [x] Front end based IO with more features like indexing and shuffling

Roadmap

Source

piiswrong

👍29

Most helpful comment

[ ] High level flex RNN interface
1. - [ ] one2one, one2many seq2seq
2. - [ ] speech example
3. - [ ] lm example
4. - [ ] distributed data/model parallel benchmark
5. - [ ] attention
6. - [ ] memory/ntm
7. - [ ] better CTC support

antinucleon on 6 Aug 2016

👍13

All 32 comments

I would propose Float16 support as an additional target.

vchuravy on 6 Aug 2016

👍2

[ ] High level flex RNN interface
1. - [ ] one2one, one2many seq2seq
2. - [ ] speech example
3. - [ ] lm example
4. - [ ] distributed data/model parallel benchmark
5. - [ ] attention
6. - [ ] memory/ntm
7. - [ ] better CTC support

antinucleon on 6 Aug 2016

👍13

For optimization part, @tqchen and I are thinking about supporting throw optimizer into computation graph, so less ccxx will be needed.

antinucleon on 6 Aug 2016

Until we have RTC that doesn't help much. You still need at least 2x buffer.

piiswrong on 6 Aug 2016

We may consider to building document on EC2, then sync back to readdoc because doc build fail for time out in compile.

antinucleon on 6 Aug 2016

yes. or maybe just host from ec2

piiswrong on 6 Aug 2016

great!!
@piiswrong what does nnvm mean?

tornadomeet on 6 Aug 2016

👍2

@vchuravy we may need to put more effort on int8 rather than fp16. From current info, int8 will be mainstream in future.

antinucleon on 6 Aug 2016

@antinucleon Great to hear, the work @Godricly and I have been working focused purely on making our operators support arbitrary DTypes. That should help the Int8 work as well?

(this is of topic but I would expect FixedPoint with Int8 instead of truly Int8?)

vchuravy on 6 Aug 2016

@vchuravy It is still investigated by @winstywang If use int8 directly, there is no performance gain. But official document mentions for new TitanX, the int8 performance is 44T, almost 4 times than fp32.

antinucleon on 6 Aug 2016

@vchuravy NV should have specific instructions for int8, currently using int8 directly only brings 25% performance gain according to our test.

winstywang on 6 Aug 2016

👍3

My suggestion as follows:

Documentation (most important)
Some kind of graph creation debugging tool

it would be nice if we can have gui for this, its painful to debug the graph

dynamic execution capability for Operators (for example, stochastic depth and fractal network )
customOp is not Dtype compatible yet
A simple debugging Operator ( Just printing output and gradient, so u can insert them anywhere, can make some switch to decide what to print)
Check if ps-lite is compatible with DType

Godricly on 7 Aug 2016

stochastic depth can be done with bucketing.
we have monitor for debugging.

piiswrong on 7 Aug 2016

with NNVM we may enable fully dynamic execution.

antinucleon on 8 Aug 2016

@piiswrong @leopd We need to move doc building system to EC2. Readthedoc system is keeping failure because building out of time.

antinucleon on 8 Aug 2016

@antinucleon Is there any paper available right now for uint8 NN? And what is NNVM stands for? I'm having a hard time searching for it.

Godricly on 8 Aug 2016

Here are some thoughts about the docs:

[ ] A summary page of all the examples
[ ] A summary page of new features recently added. Each time a new feature is added, simple explanation and sample codes must be provided.
WE CANNOT SAY "YOU CAN JUST USE XXX" to users, but there is no doc or simple examples for XXX. Each time we mention that, a doc or example must be provided.
[ ] A step by step tutorial to teach beginners how to implement some basic operations in NN, such as finetune, extract features, etc. These could cover more than 80% usage
[ ] Finish CS231n homework and projects with Minpy and MXNet

@piiswrong @antinucleon

winstywang on 8 Aug 2016

👍7

Another thing I'd like to ask for is a refactor of LSTM; if it is possible.
Can we hide those provide_data and provide_label in an elegant way? I understand that currently approach works pretty well. But exposing the internal stuff may bring some troubles (like extra provided_data_type for me in fp16 lstm #2564).

Godricly on 8 Aug 2016

I would vote for another issue which is very important for user:

[ ] Make sure the speed and accuracy in all test cases same or better than Caffe.
Currently, we have CPU slower than Caffe, small batch slower than Caffe, and Resnet on Imagenet worse than Caffe all kinds of issues related to performance.

winstywang on 8 Aug 2016

👍2

Resnet is caused by IO. Min has reproduced exact result by using Torch IO.
The problem is who will do that.
On Mon, Aug 8, 2016 at 02:53 Naiyan Wang [email protected] wrote:

I would vote for another issue which is very important for user:

-

Make sure the speed and accuracy in all test cases same or better than
Caffe.
-

Currently, we have CPU slower than Caffe, small batch slower than
Caffe, and Resnet on Imagenet worse than Caffe all kinds of issues related
to performance.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/2944#issuecomment-238191546, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABM13o2eTqQRmXl6jgQ4zWF3EmE8-BL9ks5qdvyjgaJpZM4JeH5k
.

Sent from mobile phone

antinucleon on 8 Aug 2016

I hope that for each of the issue raised, people can show up and assign, or self assign each of the issue, so we are moving forward effectively.

tqchen on 8 Aug 2016

it's good to have a single page containing all things. but total agree that we can open issue for each point and cite the links here.

mli on 8 Aug 2016

@mli Yes. If someone wants to talk more about/start working on a task, feel free to open a new issue and link it here. Also assign it to milestone v1.0

piiswrong on 8 Aug 2016

Also we may consider to treat warning as error in the future.

antinucleon on 13 Aug 2016

I'll list a roadmap for scala pkg this weekend.

yzhliu on 18 Aug 2016

👍2

@antinucleon Can I know what's wrong with IO that causes the performance drop?

taoari on 19 Aug 2016

For docs, I think the query of our github issues with keyword "how to" is a good source for getting a list of topics to potentially cover.

pluskid on 19 Aug 2016

@piiswrong What does NNVM stands for?

windywinter on 19 Aug 2016

@windywinter about NNVM: https://github.com/dmlc/MXNet.jl/pull/115

tornadomeet on 20 Aug 2016

@antinucleon, @jennyzhang0215 and I have implemented MemN2N and NTM and replicated the results in the paper, we may release the code after AAAI or WWW. I can send you the code if you need now.