Incubator-mxnet: One question about Multi-gpu

Created on 15 Mar 2016 · 13Comments · Source: apache/incubator-mxnet

Hi,
I have a question about how Multi-gpu works. Is it like each gpu deals with a mini-batch and after one forward, we merge them together and do backward? What if I have a big network that can not fit into a gpu? does it put one part on one gpu and put the rest on another one and do a serial job? How am I supposed to set these? Thanks a lot

Source

horserma

Most helpful comment

Hi, @tuonion ,
Basically, I tried it in this way. First of all, declare the variables in different groups like:

with mx.AttrScope(ctx_group="dev1"):
    data = mx.symbol.Variable("data")
    conv1 = mx.symbol.Convolution(data=data, kernel=(7,7), workspace=workspace,
                                num_filter=64, name="conv1", pad=(3,3), stride=(2,2))

So data and conv1 will be assigned to "dev1". likewise, you can put something with "dev2". Be sure that each layer is associated with a device label. Then when you bind the network, you give group2ctx a dict which could be:

net.bind(....,
    group2ctx={'dev1': mx.gpu(0), 'dev2': mx.gpu(1)})

This is what I did basically. In practice, I found it not that useful as expected because the it was not accelerated. I guess it was because the frequent communication among devices, but I am not completely sure. I hope they could provide more information or tutorial regarding this.

horserma on 14 Apr 2016

👍2

All 13 comments

your problem can be solved by _model parallelism_ i remember @tqchen has implemented it

mli on 16 Mar 2016

Could you be more specific? I have gone through the documents and I am not sure which part is what I need. In the python wrapper of FeedFoward python function, it seems the first case where a whole batch is divided into mini batches not the whole network is divided into several parts. Thanks a lot.

horserma on 16 Mar 2016

I am creating an example this week for LSTM, hopefully I can get back to you with more information.

On the other hand, please be aware that mxnet is heavily optimized for memory consumption, and it is likely that it can do what other frameworks cannot in one GPU. There is also an experimental option for convnet that can be used to reduce memory by around half cc @antinucleon for the instruction

tqchen on 17 Mar 2016

👍1

I split the model by using ctx_group in bind. What I did was I split each convolution layer and gave each of them a ctx context. For example, I want to use two gpus, ("dev0", "dev1"). More specifically, I put half of the filters of each layer on each gpu, which means I defined two new layers with half of the fiters of the original one. Then after these two sublayers, I concatenate them in order to let the next layers use the whole part of the feature maps. I split all the conv layers in this way. The motive of doing this was to split the calculation burden, but I found the reality is not. It seems that the communication cost between two gpus has exceeded the benefit of spliting the calculation. The speed is way slower due to the frequent splitting and concatenating. Any advice regarding this? @mli @tqchen

horserma on 27 Mar 2016

👍1

Hi, @horserma !

I am doing some work(model parallelism) on the mxnet as you. I am a new mxneter, can you give me some guidance on how to “split the model by using ctx_group in bind” as you mentioned the above.

Thank you very much.

tuonion on 14 Apr 2016

Hi, @tuonion ,
Basically, I tried it in this way. First of all, declare the variables in different groups like:

with mx.AttrScope(ctx_group="dev1"):
    data = mx.symbol.Variable("data")
    conv1 = mx.symbol.Convolution(data=data, kernel=(7,7), workspace=workspace,
                                num_filter=64, name="conv1", pad=(3,3), stride=(2,2))

net.bind(....,
    group2ctx={'dev1': mx.gpu(0), 'dev2': mx.gpu(1)})

horserma on 14 Apr 2016

👍2

@horserma , thank you for your very speedy response . Your answers i believe will bring lots of help to me, and I am also hope the developer can provide more information about the communication between nodes/devices.

By the way, can you share your email-address with me? thanks very much.

tuonion on 14 Apr 2016

@tqchen @antinucleon

There is also an experimental option for convnet that can be used to reduce memory by around half cc @antinucleon for the instruction

A secret experimental option? How would I enable it? I am working on big images and anything that reduces memory requirements is interesting to me.

vchuravy on 15 Apr 2016

@vchuravy MXNET_BACKWARD_DO_MIRROR=1, hopefully we will make it official with more explanations soon

tqchen on 15 Apr 2016

Hi, @tqchen @mli .Thanks for your work firstly.
I am a question about "model parallelism" ：
I find the model parallelism of LSTM in mxnet is implemented by binding one layer just to one gpu. So, is it easy to OR can it implement the model parallelism like cuda-convnet2 implemented in mxnet?
Looking forward to your answers, thank you.

tuonion on 18 Apr 2016

this is also related for saving memory https://github.com/dmlc/mxnet-memonger @vchuravy

tqchen on 22 Apr 2016

Closing it for now due to inactivity. Please feel free to reopen it for followup discussions.

eric-haibin-lin on 28 Sep 2017

@horserma What is your ctx?

net.bind(...., # ctx= ??
    group2ctx={'dev1': mx.gpu(0), 'dev2': mx.gpu(1)})

KeyKy on 29 Sep 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.

zy-huang · 3Comments

what's the usage of ' is_train' in forward?

xzqjack · 3Comments

Automatic Batching for Dynamic Graphs

sbodenstein · 3Comments

fine-tuning and freezing layers

yuconglin · 3Comments

Correct Speedometer Callback Usage

JonBoyleCoding · 3Comments