Incubator-mxnet: [Discussion] 1.6.0 Roadmap

Created on 18 Jul 2019 · 21Comments · Source: apache/incubator-mxnet

Let's start a discussion here about the roadmap towards 1.6.0. We are looking for:

New features that are useful to your research and development.
Improvements and patches to existing features.
If you have any item that you'd like to propose to have in the roadmap, please do:

Create (or locate existing) issue/pull request for the item, note the issue/pull request number.
Comment in this issue: 1) the above issue number, 2) one sentence of what the item is about and why it's useful to you.
Indicate whether you'd be willing to help out on the item.
Share the ETA if you're driving the item and have an guesstimate on when it will be done.

Feel free to include items that weren't included in past roadmap discussions that you still wish to include in this release.
cc @apache/mxnet-committers

Roadmap

Source

szha

👍9

Most helpful comment

Can we work towards INT-64 support (Large Tensor) as DEFAULT. This would make it easy for Large Tensor use cases such as DGL, recommendation system models to use MXNet easily. Today, they need to compile from source. (CC @apeforest @access2rohit)

sandeep-krishnamurthy on 18 Jul 2019

👍8

All 21 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

mxnet-label-bot on 18 Jul 2019

Can we work towards INT-64 support (Large Tensor) as DEFAULT. This would make it easy for Large Tensor use cases such as DGL, recommendation system models to use MXNet easily. Today, they need to compile from source. (CC @apeforest @access2rohit)

sandeep-krishnamurthy on 18 Jul 2019

👍8

Can we work towards INT-64 support (Large Tensor) as DEFAULT. This would make it easy for Large Tensor use cases such as DGL, recommendation system models to use MXNet easily. Today, they need to compile from source. (CC @apeforest @access2rohit)

There are a few regressions that have been introduced since the large tensor support was added. For example, there are a few hierarchical attention networks which were training fine until 1.3.1 but since the addition of large tensor support, results in NaNs in weight vectors and gradient calculations. We probably should ensure all regressions are fixed with regard to this feature.

Related issues and PRs that might be relevant( the above issue is in addition to what is discussed in the below links) -

anirudhacharya on 18 Jul 2019

👍2

@sandeep-krishnamurthy Agree INT64 enhancements.
We plan to upgrade MKLDNN to 1.0 in r1.6 so that MKLDNN backend can work for INT64 index.

pengzhao-intel on 19 Jul 2019

👍1

CPU related proposal for 1.6 (keep updating in the following several days).

WIP, MKL-DNN upgrade to 1.0
the proposal will be sent out soon in dev@ @TaoLv
https://github.com/apache/incubator-mxnet/projects/16
DONE, MKL-DNN Subgraph Fusion as default @ZhennanQin #15518
DONE, Quantization API improvement @xinyu-intel #15448
WIP, RNN (vRNN/LSTM/GRU) with MKL-DNN acceleration @ciyongch @zixuanweeei
Enable new models, like NCF, @xinyu-intel

pengzhao-intel on 19 Jul 2019

👍4

Agree INT64 enhancements, too.
For the regression problem, a good solution is to choose data type by the data size in the performance bottleneck.

The following features are useful for research and development:

Need register_backward_hook() function in mxnet #15411
backward_hook is useful to watch gradient for debug, and modify gradient like normalizing it.

wkcn on 19 Jul 2019

👍6

I think make (pre-trained) NN editable directly may helpful.
For example, I want to train a NN, whose input is (data,aux_data) and its output is 'predict'.
with the progression of training, I may found that aux_data is useless and may introduce additional bias in both training and testing.
Fortunately, since L2-regularization will make all the params related to aux_data near 0, I could directly set all the params related to aux_data to 0.
But, the net with new parameters are slower than a new net with the pre-trained params.
I think it is useful to enable direct editing for NNs.
I think it will be more fancy:

batch_size=1
data=mx.sym.var('data',shape=(batch_size,1))
aux_data=mx.sym.var('aux_data',shape=(batch_size,1))
combine=mx.sym.Concat(data,aux,dim=-1)
out=mx.sym.FullyConnected(combine,num_hidden=3).softmax()
net=mx.mod.Module(symbol=out,data_names=('data','aux_data'))
#(train and found that aux is useless)
net.sym#`.sym` returns the names of symbols used in the net
net.pop('aux_data')#will remove aux from net, and further, the net does not need `Concat` function anymore, and the params for `out` reduce 3 since `aux` is removed.

Since it is very inconvenient to change the params in ParameterDict(I only knows that use mx.init.Constant may help, but that is too inconvenient for me.), add a pop method to the net may help.

What's more, I think MXNet needs a relocatable .dll/.so file
libmxnet.dll is too large and takes too much time to load in my Windows 10

I asked how to decrease the dll size (since there is too much archs in the single .dll file and what I want is just -arch=sm_61 for my GTX 1060)

The reply is to use nvprune, I tried it and it gives me an error:

nvprune fatal   : Input file 'libmxnet.dll' not relocatable

Make the libmxnet.dll/libmxnet.so relocatable will provides the possibility to further decrease the size of the dll file. And may further more decrease the import time.
It seems the symbol are contained in libmxnet.lib, but I can not merge these symbols into libmxnet.dll. If someone find a way to redistribute a libmxnet.dll with symbols, we may take less time import mxnet in python

Neutron3529 on 26 Jul 2019

👍2

It would be nice to add support for default checkpointing in Estimator API. For reference, TF estimator APIs does provide default checkpointing of models trained with Estimator API, which is very useful for users.

Also, can we plan to graduate Estimator API in MXNet 1.6? This is super useful API for MXNet users.

@roywei - Any comments?

sandeep-krishnamurthy on 1 Aug 2019

👍1

Thanks to @ChaiBapchya we now have performance comparison data between int32 and int64: https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit#gid=843443107

apeforest on 27 Aug 2019

We have multiple improvements to BERT inference and training speed that we would like to be part of 1.6 release:

[x] Softmax optimizations (#15545 )
[x] Pointwise fusion for GPU (#15167 )
[x] Eliminate common expressions (#15657 )
[x] Bias speed improvements (#16039 )
[x] Aggregated AdamW optimizer (#16398)
[x] Aggregated zeroing of the gradients (#16446)
[x] Aggregated sum of squares operator (also used in LARS, #16122)
[x] Embedding gradient optimization (#16355)
[ ] Faster multihead attention operator (#16408)

ptrendx on 29 Aug 2019

👍5

Moving fixing nightly failure from 1.5.1 scope to 1.6.0 as they are failing on master branch not 1.5.x branch.
https://github.com/apache/incubator-mxnet/issues/15613#issuecomment-516937546

nightly test failure need to be fixed:

15374

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/395/pipeline/

roywei on 30 Aug 2019

👍1

@reminisce @haojin2 given that numpy operators would be a major topic in 1.6 release, it would be great if you could add what you intend to include in this release.

szha on 13 Sep 2019

🚀1

If I want to use mxnet.gluon.nn.Conv2D to get a depthwise conv layer, I need to explicit set the group argument. Can this be infered automatically?

hkingtswcbyy on 29 Sep 2019

As for NumPy compatibility, we would like to add

A set of NumPy operators mostly with CPU/GPU forward/backward supports. Number TBD.
A new ndarray class that replicates NumPy's ndarray in most ways. (Differences will be documented).
NumPy op integration with Gluon components, including layers, model zoo, parameters, data loaders, etc.

reminisce on 29 Sep 2019

TRT is now working from the CPP package. I think to consider it a released feature we'd want to update documentation and possible target the new TRT version (6).

KellenSunderland on 3 Oct 2019

🚀3

I am working on an interface for multi threaded inference in MXNet and it would be great if it could go in 1.6.

anirudh2290 on 3 Oct 2019

❤3

@anirudh2290 this sounds like a larger change. Would you link to the RFC for it?

szha on 5 Oct 2019

@szha yes I am planning to add a RFC this week.

anirudh2290 on 7 Oct 2019

👍1

For reference, users are confused that Large Tensor Support was enabled in MXNet 1.4 and then disabled again in 1.5. Reference: https://github.com/dmlc/gluon-nlp/issues/981
For 1.6, can we improve the error message to suggest people to compile with Large Tensor Support?