Let's start a discussion here about the roadmap towards 1.6.0. We are looking for:
New features that are useful to your research and development.
Improvements and patches to existing features.
If you have any item that you'd like to propose to have in the roadmap, please do:
Create (or locate existing) issue/pull request for the item, note the issue/pull request number.
Comment in this issue: 1) the above issue number, 2) one sentence of what the item is about and why it's useful to you.
Indicate whether you'd be willing to help out on the item.
Share the ETA if you're driving the item and have an guesstimate on when it will be done.
Feel free to include items that weren't included in past roadmap discussions that you still wish to include in this release.
cc @apache/mxnet-committers
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature
- Can we work towards INT-64 support (Large Tensor) as DEFAULT. This would make it easy for Large Tensor use cases such as DGL, recommendation system models to use MXNet easily. Today, they need to compile from source. (CC @apeforest @access2rohit)
There are a few regressions that have been introduced since the large tensor support was added. For example, there are a few hierarchical attention networks which were training fine until 1.3.1 but since the addition of large tensor support, results in NaNs in weight vectors and gradient calculations. We probably should ensure all regressions are fixed with regard to this feature.
Related issues and PRs that might be relevant( the above issue is in addition to what is discussed in the below links) -
@sandeep-krishnamurthy Agree INT64 enhancements.
We plan to upgrade MKLDNN to 1.0 in r1.6 so that MKLDNN backend can work for INT64 index.
CPU related proposal for 1.6 (keep updating in the following several days).
the proposal will be sent out soon in dev@ @TaoLv
https://github.com/apache/incubator-mxnet/projects/16
DONE, MKL-DNN Subgraph Fusion as default @ZhennanQin #15518
DONE, Quantization API improvement @xinyu-intel #15448
WIP, RNN (vRNN/LSTM/GRU) with MKL-DNN acceleration @ciyongch @zixuanweeei
Enable new models, like NCF, @xinyu-intel
Agree INT64 enhancements, too.
For the regression problem, a good solution is to choose data type by the data size in the performance bottleneck.
The following features are useful for research and development:
I think make (pre-trained) NN editable directly may helpful.
For example, I want to train a NN, whose input is (data,aux_data)
and its output is 'predict'.
with the progression of training, I may found that aux_data
is useless and may introduce additional bias in both training and testing.
Fortunately, since L2-regularization will make all the params related to aux_data
near 0
, I could directly set all the params related to aux_data
to 0
.
But, the net with new parameters are slower than a new net with the pre-trained params.
I think it is useful to enable direct editing for NNs.
I think it will be more fancy:
batch_size=1
data=mx.sym.var('data',shape=(batch_size,1))
aux_data=mx.sym.var('aux_data',shape=(batch_size,1))
combine=mx.sym.Concat(data,aux,dim=-1)
out=mx.sym.FullyConnected(combine,num_hidden=3).softmax()
net=mx.mod.Module(symbol=out,data_names=('data','aux_data'))
#(train and found that aux is useless)
net.sym#`.sym` returns the names of symbols used in the net
net.pop('aux_data')#will remove aux from net, and further, the net does not need `Concat` function anymore, and the params for `out` reduce 3 since `aux` is removed.
Since it is very inconvenient to change the params in ParameterDict(I only knows that use mx.init.Constant may help, but that is too inconvenient for me.), add a pop
method to the net may help.
What's more, I think MXNet needs a relocatable .dll/.so file
libmxnet.dll
is too large and takes too much time to load in my Windows 10
I asked how to decrease the dll size (since there is too much arch
s in the single .dll file and what I want is just -arch=sm_61
for my GTX 1060)
The reply is to use nvprune
, I tried it and it gives me an error:
nvprune fatal : Input file 'libmxnet.dll' not relocatable
Make the libmxnet.dll
/libmxnet.so
relocatable will provides the possibility to further decrease the size of the dll
file. And may further more decrease the import time.
It seems the symbol are contained in libmxnet.lib
, but I can not merge these symbols into libmxnet.dll
. If someone find a way to redistribute a libmxnet.dll
with symbols, we may take less time import mxnet in python
It would be nice to add support for default checkpointing in Estimator API. For reference, TF estimator APIs does provide default checkpointing of models trained with Estimator API, which is very useful for users.
Also, can we plan to graduate Estimator API in MXNet 1.6? This is super useful API for MXNet users.
@roywei - Any comments?
Thanks to @ChaiBapchya we now have performance comparison data between int32 and int64: https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit#gid=843443107
We have multiple improvements to BERT inference and training speed that we would like to be part of 1.6 release:
Moving fixing nightly failure from 1.5.1 scope to 1.6.0 as they are failing on master branch not 1.5.x branch.
https://github.com/apache/incubator-mxnet/issues/15613#issuecomment-516937546
nightly test failure need to be fixed:
15374
@reminisce @haojin2 given that numpy operators would be a major topic in 1.6 release, it would be great if you could add what you intend to include in this release.
If I want to use mxnet.gluon.nn.Conv2D to get a depthwise conv layer, I need to explicit set the group argument. Can this be infered automatically?
As for NumPy compatibility, we would like to add
ndarray
class that replicates NumPy's ndarray
in most ways. (Differences will be documented).TRT is now working from the CPP package. I think to consider it a released feature we'd want to update documentation and possible target the new TRT version (6).
I am working on an interface for multi threaded inference in MXNet and it would be great if it could go in 1.6.
@anirudh2290 this sounds like a larger change. Would you link to the RFC for it?
@szha yes I am planning to add a RFC this week.
For reference, users are confused that Large Tensor Support was enabled in MXNet 1.4 and then disabled again in 1.5. Reference: https://github.com/dmlc/gluon-nlp/issues/981
For 1.6, can we improve the error message to suggest people to compile with Large Tensor Support?
@leezu I am working on it. Here is the WIP PR: https://github.com/apache/incubator-mxnet/pull/16570
Hi I was doing some testing for mxnet 1.6.x and 1.5.1 and I noticed some performance issues in training you can find more details here: #16845
Most helpful comment