Hi,
I have a question about how Multi-gpu works. Is it like each gpu deals with a mini-batch and after one forward, we merge them together and do backward? What if I have a big network that can not fit into a gpu? does it put one part on one gpu and put the rest on another one and do a serial job? How am I supposed to set these? Thanks a lot
your problem can be solved by _model parallelism_ i remember @tqchen has implemented it
Could you be more specific? I have gone through the documents and I am not sure which part is what I need. In the python wrapper of FeedFoward python function, it seems the first case where a whole batch is divided into mini batches not the whole network is divided into several parts. Thanks a lot.
I am creating an example this week for LSTM, hopefully I can get back to you with more information.
On the other hand, please be aware that mxnet is heavily optimized for memory consumption, and it is likely that it can do what other frameworks cannot in one GPU. There is also an experimental option for convnet that can be used to reduce memory by around half cc @antinucleon for the instruction
I split the model by using ctx_group in bind. What I did was I split each convolution layer and gave each of them a ctx context. For example, I want to use two gpus, ("dev0", "dev1"). More specifically, I put half of the filters of each layer on each gpu, which means I defined two new layers with half of the fiters of the original one. Then after these two sublayers, I concatenate them in order to let the next layers use the whole part of the feature maps. I split all the conv layers in this way. The motive of doing this was to split the calculation burden, but I found the reality is not. It seems that the communication cost between two gpus has exceeded the benefit of spliting the calculation. The speed is way slower due to the frequent splitting and concatenating. Any advice regarding this? @mli @tqchen
Hi, @horserma !
I am doing some work(model parallelism) on the mxnet as you. I am a new mxneter, can you give me some guidance on how to “split the model by using ctx_group in bind” as you mentioned the above.
Thank you very much.
Hi, @tuonion ,
Basically, I tried it in this way. First of all, declare the variables in different groups like:
with mx.AttrScope(ctx_group="dev1"):
data = mx.symbol.Variable("data")
conv1 = mx.symbol.Convolution(data=data, kernel=(7,7), workspace=workspace,
num_filter=64, name="conv1", pad=(3,3), stride=(2,2))
So data and conv1 will be assigned to "dev1". likewise, you can put something with "dev2". Be sure that each layer is associated with a device label. Then when you bind the network, you give group2ctx a dict which could be:
net.bind(....,
group2ctx={'dev1': mx.gpu(0), 'dev2': mx.gpu(1)})
This is what I did basically. In practice, I found it not that useful as expected because the it was not accelerated. I guess it was because the frequent communication among devices, but I am not completely sure. I hope they could provide more information or tutorial regarding this.
@horserma , thank you for your very speedy response . Your answers i believe will bring lots of help to me, and I am also hope the developer can provide more information about the communication between nodes/devices.
By the way, can you share your email-address with me? thanks very much.
@tqchen @antinucleon
There is also an experimental option for convnet that can be used to reduce memory by around half cc @antinucleon for the instruction
A secret experimental option? How would I enable it? I am working on big images and anything that reduces memory requirements is interesting to me.
@vchuravy MXNET_BACKWARD_DO_MIRROR=1, hopefully we will make it official with more explanations soon
Hi, @tqchen @mli .Thanks for your work firstly.
I am a question about "model parallelism" :
I find the model parallelism of LSTM in mxnet is implemented by binding one layer just to one gpu. So, is it easy to OR can it implement the model parallelism like cuda-convnet2 implemented in mxnet?
Looking forward to your answers, thank you.
this is also related for saving memory https://github.com/dmlc/mxnet-memonger @vchuravy
Closing it for now due to inactivity. Please feel free to reopen it for followup discussions.
@horserma What is your ctx?
net.bind(...., # ctx= ??
group2ctx={'dev1': mx.gpu(0), 'dev2': mx.gpu(1)})
Most helpful comment
Hi, @tuonion ,
Basically, I tried it in this way. First of all, declare the variables in different groups like:
So data and conv1 will be assigned to "dev1". likewise, you can put something with "dev2". Be sure that each layer is associated with a device label. Then when you bind the network, you give group2ctx a dict which could be:
This is what I did basically. In practice, I found it not that useful as expected because the it was not accelerated. I guess it was because the frequent communication among devices, but I am not completely sure. I hope they could provide more information or tutorial regarding this.