Incubator-mxnet: 2 networks which shared same weight on some layers

Created on 11 Jul 2017 · 18Comments · Source: apache/incubator-mxnet

hello,

I have two datasets different with differents kind of label different. So I have two different networks but I want to shared some layers (the first convolutional) between them during the training.
Exemple :

Input 1 dataset 1  -> | VGG16 convolutional layers shared | -> | VGG16 layers 1 unshared | -> output 1
Input 2 dataset 2 -> | VGG16 convolutional layers shared | -> | VGG16 layers 2 unshared | -> output 2

For the training step I was looking for trained the layers shared batch by batch alternating the dataset 1 and the dataset 2.
something like that :

# one iteration
net1.forward(batch1.next())
net1.backward()
net1.update
net2.forward(batch2.next())
net2.backward()
net2.update()

How can I shared the weight with two networks? (and not only one network because I have two different datasets)

Source

Shiro-LK

Most helpful comment

I am having the same issue - I would really like to be able to share parameters across two modules, because each module will be reading from a distinct data stream and training a distinct network with only some shared components. A somewhat minimal example appears below:

import logging
import mxnet as mx

mnist = mx.test_utils.get_mnist()

batch_size = 100
train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)
data = mx.sym.var('data')
data = mx.sym.flatten(data=data)

# Set up variables to share parameters for all layers
w1 = mx.sym.Variable('1_weight')
w2 = mx.sym.Variable('2_weight')
w3 = mx.sym.Variable('3_weight')
b1 = mx.sym.Variable('1_bias')
b2 = mx.sym.Variable('2_bias')
b3 = mx.sym.Variable('3_bias')

# Build network 1 with explicit weight pointers
fc1 = mx.sym.FullyConnected(data=data, num_hidden=128, weight=w1, bias=b1)
act1 = mx.sym.Activation(data=fc1, act_type="relu")
fc2 = mx.sym.FullyConnected(data=act1, num_hidden=64, weight=w2, bias=b2)
act2 = mx.sym.Activation(data=fc2, act_type="relu")
fc3 = mx.sym.FullyConnected(data=act2, num_hidden=10, weight=w3, bias=b3)
mlp = mx.sym.SoftmaxOutput(data=fc3, name='softmax')
# Build module 1
mlp_model = mx.mod.Module(symbol=mlp, context=mx.cpu())

# Now build a second, identical graph that shares the same weight pointers
fc1s = mx.sym.FullyConnected(data=data, num_hidden=128, weight=w1, bias=b1)
act1s = mx.sym.Activation(data=fc1s, act_type="relu")
fc2s = mx.sym.FullyConnected(data=act1s, num_hidden=64, weight=w2, bias=b2)
act2s = mx.sym.Activation(data=fc2s, act_type="relu")
fc3s = mx.sym.FullyConnected(data=act2s, num_hidden=10, weight=w3, bias=b3)
mlps = mx.sym.SoftmaxOutput(data=fc3s, name='softmax')
# Build module 2
mlp_models = mx.mod.Module(symbol=mlps, context=mx.cpu())


# Train module 1
logging.getLogger().setLevel(logging.DEBUG)  # logging to stdout
print("\n===Training module1===\n")
mlp_model.fit(train_iter,  # train data
              eval_data=val_iter,  # validation data
              optimizer='sgd',  # use SGD to train
              optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
              eval_metric='acc',  # report accuracy during training
              batch_end_callback=mx.callback.Speedometer(batch_size, 100),
              num_epoch=5)  # train for at most 10 dataset passes


# Train module 2
# We expect the shared module to start where the first module finished
print("\n===Training module2===\n")
mlp_models.shared_group = mlp_model._exec_group
mlp_models.fit(train_iter,  # train data
               eval_data=val_iter,  # validation data
               optimizer='sgd',  # use SGD to train
               optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
               eval_metric='acc',  # report accuracy during training
               batch_end_callback=mx.callback.Speedometer(batch_size, 100),
               num_epoch=5)  # train for at most 10 dataset passes


# Making sure that fit doesn't always overwrite parameters by returning to module 1
print("\n===Training module1===\n")
mlp_model.fit(train_iter,  # train data
              eval_data=val_iter,  # validation data
              optimizer='sgd',  # use SGD to train
              optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
              eval_metric='acc',  # report accuracy during training
              batch_end_callback=mx.callback.Speedometer(batch_size, 100),
              num_epoch=5)  # train for at most 10 dataset passes

cherryc on 20 Jul 2017

👍2

All 18 comments

@Shiro-mx you can refer to #6909

Ldpe2G on 11 Jul 2017

@Ldpe2G thank you for your help
I am maybe wrong but it seems that there is only one output, instead of the number of input. The function feedforward can do this with multiple output? Because my goal was to train the shared weight of the different networks but also to train the unshared weight ( the 2 or 3 last layers)

Shiro-LK on 11 Jul 2017

@Shiro-mx I suggest you to use the module api which is more flexible. The triplet loss example shows how to share weight between nets by define the weight variable with specific name, and use them in both nets.

Ldpe2G on 12 Jul 2017

I create a small network but I have a problem regarding binding. I also want to use the sgd optimizer but I don't know where to put it. I read some topic but I do not really understand what I have to do.

def get_shared_network(data, fc1_weight, fc1_bias, fc2_weight, fc2_bias):
    fc1 = mx.symbol.FullyConnected(data=data, name='fc1', num_hidden=128, weight=fc1_weight, bias=fc1_bias)
    act1 = mx.symbol.Activation(data=fc1, name='relu1', act_type="relu")
    fc2 = mx.symbol.FullyConnected(data=act1, name='fc2', num_hidden=64, weight=fc2_weight, bias=fc2_bias)
    act2 = mx.symbol.Activation(data=fc2, name='relu2', act_type="relu")

    return act2

def get_two_network():
    data1 = mx.sym.Variable('data1')
    data2 = mx.symbol.Variable('data2')

    fc1_w = mx.sym.Variable('fc1_w', init=mx.init.Constant(0.02))
    fc2_w = mx.symbol.Variable('fc2_w', init=mx.init.Constant(0.01))
    fc1_b = mx.sym.Variable('fc1_b', init=mx.init.Constant(0.08))
    fc2_b = mx.symbol.Variable('fc2_b', init=mx.init.Constant(0.09))

    act2_1 = get_shared_network(data1, fc1_w, fc1_b, fc2_w, fc2_b)
    act2_2 = get_shared_network(data2, fc1_w, fc1_b, fc2_w, fc2_b)
    fc3_1 = mx.symbol.FullyConnected(data=act2_1, name='fc3_1', num_hidden=5)
    fc3_2 = mx.symbol.FullyConnected(data=act2_2, name='fc3_2', num_hidden=5)
    softmax1 = mx.sym.SoftmaxOutput(data=fc3_1, name='softmax1')
    softmax2 = mx.sym.SoftmaxOutput(data=fc3_2, name='softmax2')
    return [softmax1, softmax2]

# Data
imgrec_train1 = 'mnistasjpg/train_data1' + '.rec'
imglist_train1 = 'mnistasjpg/train_data1' + '.lst'

imgrec_test1 = 'mnistasjpg/test_data1' + '.rec'
imglist_test1 = 'mnistasjpg/test_data1' + '.lst'

imgrec_train2 = 'mnistasjpg/train_data2' + '.rec'
imglist_train2 = 'mnistasjpg/train_data2' + '.lst'

imgrec_test2 = 'mnistasjpg/test_data2' + '.rec'
imglist_test2 = 'mnistasjpg/test_data2' + '.lst'

# # load dataset
train_dataiter1 = mx.io.ImageRecordIter(
        path_imgrec=imgrec_train1,
        data_shape=(3, x, y),
        batch_size=batch_size,
        path_imglist=imglist_train1,
        preprocess_threads=1,
        label_width=1,
        data_name='data1',
        label_name='softmax1_label',
        # shuffle=True,
        # shuffle_chunk_seed=100,
        # seed=100,
        # rand_mirror=True,
        # rand_mirror_prob=0.5,
        # random_crop=True
)
print('testing Dataset ...')
# Data validation
test_dataiter1 = mx.io.ImageRecordIter(
    path_imgrec=imgrec_test1,
    data_shape=(3, x, y),
    batch_size=batch_size,
    path_imglist=imglist_test1,
    preprocess_threads=1,
    label_width=1,
    data_name='data1',
    label_name='softmax1_label'
)

train_dataiter2 = mx.io.ImageRecordIter(
        path_imgrec=imgrec_train2,
        data_shape=(3, x, y),
        batch_size=batch_size,
        path_imglist=imglist_train2,
        preprocess_threads=1,
        label_width=1,
        data_name='data2',
        label_name='softmax2_label',
        #shuffle=True,
        # shuffle_chunk_seed=100,
        # seed=100,
        #rand_mirror=True,
        #rand_mirror_prob=0.5,
        #random_crop=True
)
print('testing Dataset ...')
# Data validation
test_dataiter2 = mx.io.ImageRecordIter(
    path_imgrec=imgrec_test2,
    data_shape=(3, x, y),
    batch_size=batch_size,
    path_imglist=imglist_test2,
    preprocess_threads=1,
    label_width=1,
    data_name='data2',
    label_name='softmax2_label'
)

sgd = mx.optimizer.create('sgd', learning_rate=.001, momentum=0.9, wd=0.0005)
name_model = get_two_network()

# load network 
mod1 = mx.mod.Module(name_model[0], context=mx.cpu(), data_names=['data1'], label_names=[output_name])
mod1.bind(data_shapes=[('data1', (1, 3, x, y))], label_shapes=[(output_name, (1,))] )
mod1.init_params()
mod2 = mx.mod.Module(name_model[1], context=mx.cpu(), data_names=['data2'], label_names=[output_name2])
mod2.bind(data_shapes=[('data2', (1, 3, x, y))], label_shapes=[(output_name2, (1,))] )
mod2.init_params()

# train

batch = train_dataiter1.next()
mod1.forward(batch, is_train= True)
result = mod1.get_outputs()[0]
print(result.asnumpy())
#mod1.backward(sgd)
mod1.update()

The error :

Traceback (most recent call last):
  File "mxnet_shared.py", line 164, in <module>
    mod1.update()
  File "/usr/local/lib/python3.5/dist-packages/mxnet-0.10.1-py3.5.egg/mxnet/module/module.py", line 610, in update
    assert self.binded and self.params_initialized and self.optimizer_initialized
AssertionError

EDIT : I maybe solve the problem adding this :
mod1.init_optimizer(kvstore='local', optimizer='sgd', optimizer_params=(('learning_rate', 0.001),('momentum', 0.9),('wd', 0.0005) ))

Shiro-LK on 12 Jul 2017

It solved the binding error, but it seems that the weight are not shared.

print('AFTER TRAIN')
print(mod1.get_params()[0]['fc2_b'].asnumpy())
print(mod2.get_params()[0]['fc2_b'].asnumpy())

it gives me :

BEFORE TRAIN
[ 0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093]
[ 0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093]
[[  9.96710181e-01   4.97425392e-15   6.05048997e-11   3.28975217e-03
    9.74552383e-12]]
[[ 1.  0.  0.  0.  0.]]
[[ 1.  0.  0.  0.  0.]]
AFTER TRAIN
[ 0.00929993  0.00930007  0.00930124  0.00930121  0.0092996   0.00930087
  0.00930096  0.00930125  0.00929955  0.0093001   0.00930067  0.00929943
  0.00929947  0.00930002  0.00929987  0.00929933  0.00930128  0.00929914
  0.00929966  0.00930015  0.00930036  0.00929985  0.00929992  0.00929972
  0.00930061  0.00929911  0.00929995  0.00930028  0.00929933  0.0092998
  0.00929865  0.00929941  0.00929879  0.00930045  0.00930095  0.00930034
  0.00929994  0.00929952  0.00930084  0.00930035  0.00930017  0.0093004
  0.00930039  0.00930087  0.00929925  0.00929921  0.00930016  0.00930016
  0.00929892  0.00930011  0.00930025  0.00930009  0.00929855  0.00930055
  0.00930054  0.00930065  0.00930017  0.00929909  0.00929966  0.0093
  0.00930044  0.00929949  0.00930084  0.00930004]
[ 0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093  0.0093
  0.0093]

why the shared weight did not work ?

Shiro-LK on 12 Jul 2017

remove the asnumpy then you might get its object id, which can help you debug. It should be consistent if your code is correct.

print('AFTER TRAIN')
print(mod1.get_params()[0]['fc2_b'])
print(mod2.get_params()[0]['fc2_b'])

zihaolucky on 13 Jul 2017

@zihaolucky Thank your for the help.
How can I obtain the id ? I get that after removing asnumpy:

AFTER TRAIN
<NDArray 64 @cpu(0)>
<NDArray 64 @cpu(0)>

Shiro-LK on 13 Jul 2017

Oops, my bad. Could you try compare two objects with Python operator is ?

BTW, did you just use two different mod? I suppose theses models are different, although they have same name for the weight.

zihaolucky on 13 Jul 2017

@zihaolucky
It doesn't seem the same. Yes, I use two mod one for each input and output.

Shiro-LK on 13 Jul 2017

I think calling two mod and then bind it input/output separately even if it is the "same" symbol create two different network. Is there a way correct this ? (maybe I am doing something wrong).
If it's not, is it possible to train only one input and one output a network which have two intput/output ? I did not find a topic about it.

Shiro-LK on 17 Jul 2017

import logging
import mxnet as mx

mnist = mx.test_utils.get_mnist()

batch_size = 100
train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)
data = mx.sym.var('data')
data = mx.sym.flatten(data=data)

# Set up variables to share parameters for all layers
w1 = mx.sym.Variable('1_weight')
w2 = mx.sym.Variable('2_weight')
w3 = mx.sym.Variable('3_weight')
b1 = mx.sym.Variable('1_bias')
b2 = mx.sym.Variable('2_bias')
b3 = mx.sym.Variable('3_bias')

# Build network 1 with explicit weight pointers
fc1 = mx.sym.FullyConnected(data=data, num_hidden=128, weight=w1, bias=b1)
act1 = mx.sym.Activation(data=fc1, act_type="relu")
fc2 = mx.sym.FullyConnected(data=act1, num_hidden=64, weight=w2, bias=b2)
act2 = mx.sym.Activation(data=fc2, act_type="relu")
fc3 = mx.sym.FullyConnected(data=act2, num_hidden=10, weight=w3, bias=b3)
mlp = mx.sym.SoftmaxOutput(data=fc3, name='softmax')
# Build module 1
mlp_model = mx.mod.Module(symbol=mlp, context=mx.cpu())

# Now build a second, identical graph that shares the same weight pointers
fc1s = mx.sym.FullyConnected(data=data, num_hidden=128, weight=w1, bias=b1)
act1s = mx.sym.Activation(data=fc1s, act_type="relu")
fc2s = mx.sym.FullyConnected(data=act1s, num_hidden=64, weight=w2, bias=b2)
act2s = mx.sym.Activation(data=fc2s, act_type="relu")
fc3s = mx.sym.FullyConnected(data=act2s, num_hidden=10, weight=w3, bias=b3)
mlps = mx.sym.SoftmaxOutput(data=fc3s, name='softmax')
# Build module 2
mlp_models = mx.mod.Module(symbol=mlps, context=mx.cpu())


# Train module 1
logging.getLogger().setLevel(logging.DEBUG)  # logging to stdout
print("\n===Training module1===\n")
mlp_model.fit(train_iter,  # train data
              eval_data=val_iter,  # validation data
              optimizer='sgd',  # use SGD to train
              optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
              eval_metric='acc',  # report accuracy during training
              batch_end_callback=mx.callback.Speedometer(batch_size, 100),
              num_epoch=5)  # train for at most 10 dataset passes


# Train module 2
# We expect the shared module to start where the first module finished
print("\n===Training module2===\n")
mlp_models.shared_group = mlp_model._exec_group
mlp_models.fit(train_iter,  # train data
               eval_data=val_iter,  # validation data
               optimizer='sgd',  # use SGD to train
               optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
               eval_metric='acc',  # report accuracy during training
               batch_end_callback=mx.callback.Speedometer(batch_size, 100),
               num_epoch=5)  # train for at most 10 dataset passes


# Making sure that fit doesn't always overwrite parameters by returning to module 1
print("\n===Training module1===\n")
mlp_model.fit(train_iter,  # train data
              eval_data=val_iter,  # validation data
              optimizer='sgd',  # use SGD to train
              optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
              eval_metric='acc',  # report accuracy during training
              batch_end_callback=mx.callback.Speedometer(batch_size, 100),
              num_epoch=5)  # train for at most 10 dataset passes

cherryc on 20 Jul 2017

👍2

@cherryc Great, maybe you can write a tutorial or example to help more people with this feature.

zihaolucky on 21 Jul 2017

No, sorry - I wasn't clear - the above code does not work as intended! Where I say, "We expect the shared module to start where the first module finished," instead the shared module starts over from random initialization, indicating that the two modules do not share parameters. I haven't yet been able to find a way to make them.

cherryc on 21 Jul 2017

Does it work if we recreate module from symbol before each fit ?

mlp_model = mx.mod.Module(symbol=mlp, context=mx.cpu())
mlp_model.fit(...)

mlp_models = mx.mod.Module(symbol=mlps, context=mx.cpu())
mlp_models.fit(...)

mlp_model = mx.mod.Module(symbol=mlp, context=mx.cpu())
mlp_model.fit(...)

edmBernard on 21 Jul 2017

@piiswrong how can we realize this feature? What if we name a param with the same name in a different module? Does it use the same NDArray?

zihaolucky on 21 Jul 2017

I test with get_params() to transfert weight between model and it seems to work.
But I'm not sure if both network have different layer after shared layer, the weight transfers will work correctly (fit function will overwrite other layer with their default initialization).
This method don't use the shared weight functionality of MXNet.
EDIT1 : if both network have layer independent after shared layer, it seems to work too.

# Train module 1
mlp_model = mx.mod.Module(symbol=mlp, context=mx.cpu())
logging.getLogger().setLevel(logging.DEBUG)  # logging to stdout
print("\n===Training module1===\n")
mlp_model.fit(train_iter,  # train data
              eval_data=val_iter,  # validation data
              optimizer='sgd',  # use SGD to train
              optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
              eval_metric='acc',  # report accuracy during training
              batch_end_callback=mx.callback.Speedometer(batch_size, 100),
              num_epoch=1)  # train for at most 10 dataset passes


# Train module 2
# We expect the shared module to start where the first module finished
print("\n===Training module2===\n")
arg_param, aux_param = mlp_model.get_params()
mlp_models = mx.mod.Module(symbol=mlps, context=mx.cpu())
#mlp_models.shared_group = mlp_model._exec_group
mlp_models.fit(train_iter,  # train data
               eval_data=val_iter,  # validation data
               optimizer='sgd',  # use SGD to train
               optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
               eval_metric='acc',  # report accuracy during training
               batch_end_callback=mx.callback.Speedometer(batch_size, 100),
               num_epoch=1,# train for at most 10 dataset passes
               arg_params=arg_param)  

arg_param, aux_param = mlp_models.get_params()
# Making sure that fit doesn't always overwrite parameters by returning to module 1
print("\n===Training module1===\n")
mlp_model.fit(train_iter,  # train data
              eval_data=val_iter,  # validation data
              optimizer='sgd',  # use SGD to train
              optimizer_params={'learning_rate': 0.1},  # use fixed learning rate
              eval_metric='acc',  # report accuracy during training
              batch_end_callback=mx.callback.Speedometer(batch_size, 100),
              num_epoch=1,
              arg_params=arg_param)  # train for at most 10 dataset passes

Shiro-LK on 21 Jul 2017

For param sharing I think you just need to use the shared_module option during mod.bind when you have 2 modules sharing the some parameters

One detail is that you need a "master_module" which has all symbols that you need, and pass it as the shared_module to the mod1.bind() and mod2.bind(), where mod1 and mod2 has a subset of the param/symbols.

eric-haibin-lin on 6 Aug 2017

👍1

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
Also, do please check out our forum (and Chinese version) for general "how-to" questions.