Keras: Possible bug with Sequence+fit_generator on multi_gpu_model (reproducible)

Created on 5 Dec 2018 · 9Comments · Source: keras-team/keras

I have keras 2.2.4, tensorflow 1.12, and I'm using cuda 9.2, all up to date.
Training on four Tesla K80 GPUs. This problem occurs the same on two or three GPUs. However, if I leave it as a standard keras model, it will train as expected on a single GPU.

I was unable to fix this problem in a more complicated network, and it turned out to be trivial to reproduce. The copy-paste-able code is below, and a picture of the error below that. If you change the batch size, the opposing shapes will still be [BATCH_SIZE*64] and [BATCH_SIZE, 64], so something that the Sequence pulls in is being flattened behind the scenes when it shouldn't be, but only when using multiple GPUs.

import tensorflow as tf
import numpy as np
from keras.utils import multi_gpu_model, Sequence
from keras.layers import Input, LSTM, Dense, TimeDistributed, Embedding
from keras.models import Model

BATCH_SIZE = 4

class trivial_Sequence(Sequence):

    def __init__(self, x_set, y_set, batch_size):
        self.x = np.zeros((batch_size, 64))
        self.y = np.zeros((batch_size, 64, 1))
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x)/float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx*self.batch_size:(idx+1)*self.batch_size]
        batch_y = self.y[idx*self.batch_size:(idx+1)*self.batch_size]

        return batch_x, batch_y


def error_train():
    #instantiate components
    td = trivial_Sequence(None, None, BATCH_SIZE)
    input = Input(shape=(None,), dtype='int32')
    emb = Embedding(output_dim=10, input_dim = 64, input_length=None)
    encode = LSTM(10, return_sequences=True, return_state = True)
    project_up = Dense(units=64, activation='softmax')

    #build network
    temp = emb(input)
    temp, _, _ = encode(temp)
    output = TimeDistributed(project_up)(temp)

    #as per the Keras documentation, even though there's no way this will OOM
    with tf.device('/cpu:0'):
        model = Model(inputs = input, outputs = output)

    parallel_model = multi_gpu_model(model, gpus=4)

    parallel_model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
                       metrics=['sparse_categorical_accuracy'])

    parallel_model.fit_generator(td, epochs=1, verbose=1)

#run it
error_train()

screen shot 2018-12-05 at 2 03 59 pm

tensorflow awaiting response buperformance

Source

paulsens

All 9 comments

Possibly linked to #11495

gabrieldemarmiesse on 6 Dec 2018

@paulsens Did you get a chance to go through #11495

Harshini-Gadige on 6 Dec 2018

Yeah I looked through that one before posting, none of the solutions seem applicable. I'm going to try removing a few more layers to narrow down the issue some more tomorrow.

paulsens on 7 Dec 2018

👍1

So it's nothing to do with Sequence or the functional API like I thought. The following gives the same flattening problem:

BATCH_SIZE = 6

def error_train():
   x = np.zeros((BATCH_SIZE, 64, 1))
   y = np.zeros((BATCH_SIZE, 64, 1))

   model = Sequential()
   encode = LSTM(10, return_sequences=True, input_shape = (64,1))
   project_up = Dense(units=50, activation='softmax')

   model.add(encode)
   model.add(TimeDistributed(project_up))
   parallel_model = multi_gpu_model(model, gpus=2)

   parallel_model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
                        metrics=['sparse_categorical_accuracy'])

   parallel_model.fit(x, y, epochs = 1, batch_size = BATCH_SIZE, verbose =1 )

error_train()

The problem appears to come from using return sequences and a time distributed layer, as the following code works fine:

BATCH_SIZE = 6
def error_train():
    x = np.zeros((BATCH_SIZE, 64, 1))
    y = np.zeros((BATCH_SIZE, 1))

    model = Sequential()
    encode = LSTM(10, return_sequences=False, input_shape = (64,1))
    project_up = Dense(units=50, activation='softmax')

    model.add(encode)
    model.add(project_up)
    parallel_model = multi_gpu_model(model, gpus=2)

    parallel_model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
                        metrics=['sparse_categorical_accuracy'])

    parallel_model.fit(x, y, epochs = 1, batch_size = BATCH_SIZE, verbose =1 )

error_train()

I am very very open to a hack solution or workaround or anything for the time being.

paulsens on 12 Dec 2018

This is not a successful workaround as I initially posted, but it does further narrow things down. Apparently multi_gpu_model isn't related to the problem, as I get the same flattening error when I split up the GPUs the old school way, as seen below. This is very strange because as I mentioned in the original post, the model works fine when run on a single GPU or just on the CPU. Is there some way I can reshape the output tensor INCLUDING the batch dimension? It seems like Flatten and Reshape can't touch the batch dimension. That's the only hack solution I can think of.

import tensorflow as tf
import numpy as np
from keras.utils import multi_gpu_model, Sequence
from keras.layers import Input, LSTM, Dense, TimeDistributed, Embedding, Lambda, Concatenate
from keras.models import Model

BATCH_SIZE = 4
num_gpus = 2

#credit to @marc-moreaux on github for his crop function, found in #890 
def crop(dimension, start, end):
    # Crops (or slices) a Tensor on a given dimension from start to end
    # example : to crop tensor x[:, :, 5:10]
    # call slice(2, 5, 10) as you want to crop on the second dimension
    def func(x):
        if dimension == 0:
            return x[start: end]
        if dimension == 1:
            return x[:, start: end]
        if dimension == 2:
            return x[:, :, start: end]
        if dimension == 3:
            return x[:, :, :, start: end]
        if dimension == 4:
            return x[:, :, :, :, start: end]
    return Lambda(func)

class trivial_Sequence(Sequence):

    def __init__(self, x_set, y_set, batch_size):
        self.x = np.zeros((batch_size, 64))
        self.y = np.zeros((batch_size, 64, 1))
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x)/float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx*self.batch_size:(idx+1)*self.batch_size]
        batch_y = self.y[idx*self.batch_size:(idx+1)*self.batch_size]

        return batch_x, batch_y


def error_train():
    #instantiate components
    td = trivial_Sequence(None, None, BATCH_SIZE)
    input = Input(shape=(None,), dtype='int32')
    emb = Embedding(output_dim=10, input_dim = 64, input_length=None)
    encode = LSTM(10, return_sequences=True, return_state = True)
    project_up = Dense(units=64, activation='softmax')

    #build network
    input_0 = crop(0, 0, int(BATCH_SIZE/num_gpus))(input)
    input_1 = crop(0, int(BATCH_SIZE/num_gpus), BATCH_SIZE)(input)
    #generalize the factor for more gpus, easy

    with tf.device('/gpu:0'):
        temp0 = emb(input_0)
        temp0, _, _ = encode(temp0)
        output0 = TimeDistributed(project_up)(temp0)

    with tf.device('/gpu:1'):
        temp1 = emb(input_1)
        temp1, _, _ = encode(temp1)
        output1 = TimeDistributed(project_up)(temp1)

    with tf.device('/cpu:0'):
        output = Concatenate(axis=0)([output0, output1])

    with tf.device('/cpu:0'):
        model = Model(inputs = input, outputs = output)


    model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
                   metrics=['sparse_categorical_accuracy'])

    model.fit_generator(td, epochs=1, verbose=1)

#run it
error_train()

paulsens on 18 Dec 2018

UPDATE: as far as I can tell, downgrading to Keras 2.2.0 gets around this flattening error, but since it's an older version there are some other looming bugs that I'm trying to sort through now...

paulsens on 18 Dec 2018

This should be my last update on this issue as I have solved all the problems in the architecture I'm using. Downgrading to Keras 2.2.0 allows all code previously posted in this issue to run, that is, code using return_sequences+TimeDistributed with multi_gpu_model, but a more complex architecture may still encounter an issue. The issue I had can be seen in the following basic encoder decoder where multi_gpu_model does not split up the input to an AUXILIARY input properly, and you get a shape mismatch of [batchsize/num_gpus, x] vs [batchsize, x] (the workaround code is further down):

import tensorflow as tf
import numpy as np
from keras.utils import multi_gpu_model
from keras.layers import Input, LSTM, Dense, TimeDistributed, Embedding
from keras.models import Model

BATCH_SIZE = 8
num_gpus = 2

class trivial_Sequence(Sequence):

    def __init__(self, x_set, y_set, batch_size):
        self.x = np.zeros((batch_size, 64))
        self.w = np.zeros((batch_size,64))
        self.y = np.zeros((batch_size, 64, 1))
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x)/float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx*self.batch_size:(idx+1)*self.batch_size]
        batch_w = self.w[idx*self.batch_size:(idx+1)*self.batch_size]
        batch_y = self.y[idx*self.batch_size:(idx+1)*self.batch_size]

        return [batch_x, batch_w], batch_y

def error_train():
    #instantiate components
    td = trivial_Sequence(None, None, BATCH_SIZE)
    input = Input(shape=(None,), dtype='int32')
    aux_input = Input(shape=(None,), dtype='int32')
    emb = Embedding(output_dim=10, input_dim = 64, input_length=None)
    encode = LSTM(10, return_sequences=True, return_state = False)
    encode2 = LSTM(10, return_sequences=False, return_state = True)
    decode = LSTM(10, return_sequences=True)
    decode2 = LSTM(10,return_sequences=True, return_state = True)


    project_up = Dense(units=64, activation='softmax')

    #build network
    temp = emb(input)
    temp = encode(temp)
    temp, state_h, state_c = encode2(temp)
    encoder_states = [state_h, state_c]

    temp = emb(aux_input)
    temp = decode(temp, initial_state = encoder_states)
    output, _, _ = decode2(temp)

    output = TimeDistributed(project_up)(output)

    with tf.device('/cpu:0'):
        model = Model(inputs = [input,aux_input], outputs = output)

    parallel_model = multi_gpu_model(model, gpus=2)

    parallel_model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
                       metrics=['sparse_categorical_accuracy'])

    parallel_model.fit_generator(td, epochs=1, verbose=1)

#run it
error_train()

I do not know if the current version of Keras experiences this same problem because I never got past the flattening error on the current version. This is perhaps the same problem as #9178, but their solution is not applicable to most architectures. However, this can be worked around (again, on 2.2.0) by splitting up the input manually with a crop function and using the old school implementation of multiple gpus as follows:

#see previous imports

BATCH_SIZE = 8
num_gpus = 2

#credit to @marc-moreaux on github for his crop function, found in #890
def crop(dimension, start, end):
    # Crops (or slices) a Tensor on a given dimension from start to end
    # example : to crop tensor x[:, :, 5:10]
    # call slice(2, 5, 10) as you want to crop on the second dimension
    def func(x):
        if dimension == 0:
            return x[start: end]
        if dimension == 1:
            return x[:, start: end]
        if dimension == 2:
            return x[:, :, start: end]
        if dimension == 3:
            return x[:, :, :, start: end]
        if dimension == 4:
            return x[:, :, :, :, start: end]
    return Lambda(func)


class trivial_Sequence(Sequence):
    #see above

def good_train():
    #instantiate components
    td = trivial_Sequence(None, None, BATCH_SIZE)
    input = Input(shape=(None,), dtype='int32')
    aux_input = Input(shape=(None,), dtype='int32')
    emb = Embedding(output_dim=10, input_dim = 64, input_length=None)
    encode = LSTM(10, return_sequences=True, return_state = False)
    encode2 = LSTM(10, return_sequences=False, return_state = True)
    decode = LSTM(10, return_sequences=True)
    decode2 = LSTM(10, return_sequences=True, return_state = True)


    project_up = Dense(units=64, activation='softmax')

    #build network
    input_0 = crop(0, 0, int(BATCH_SIZE / num_gpus))(input)
    input_1 = crop(0, int(BATCH_SIZE / num_gpus), BATCH_SIZE)(input)

    aux_input_0 = crop(0, 0, int(BATCH_SIZE / num_gpus))(aux_input)
    aux_input_1 = crop(0, int(BATCH_SIZE / num_gpus), BATCH_SIZE)(aux_input)

    with tf.device('/gpu:0'):
        temp = emb(input_0)
        temp = encode(temp)
        temp, state_h, state_c = encode2(temp)
        encoder_states = [state_h, state_c]

        temp = emb(aux_input_0)
        temp = decode(temp, initial_state=encoder_states)
        output0, _, _ = decode2(temp)

        output0 = TimeDistributed(project_up)(output0)

    with tf.device('/gpu:1'):
        temp = emb(input_1)
        temp = encode(temp)
        temp, state_h, state_c = encode2(temp)
        encoder_states = [state_h, state_c]

        temp = emb(aux_input_1)
        temp = decode(temp, initial_state=encoder_states)
        output1, _, _ = decode2(temp)

        output1 = TimeDistributed(project_up)(output1)

    with tf.device('/cpu:0'):
        output = Concatenate(axis=0)([output0, output1])

    with tf.device('/cpu:0'):
        model = Model(inputs = [input,aux_input], outputs = output)

    model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
                       metrics=['sparse_categorical_accuracy'])

    model.fit_generator(td, epochs=1, verbose=1)

#run it
good_train()

So indeed the combination of downgrading to Keras 2.2.0 and using the longform multiple gpu implementation will allow you to train using return_sequences+TimeDistributed on multiple gpus with auxiliary input(s). If you're NOT using auxiliary input(s), downgrading+multi_gpu_model will probably be enough. I hope this helps someone in the future, and I'd be happy to answer questions if any of this doesn't make sense.

paulsens on 19 Dec 2018

👍1

Any word on a fix for this? The cropping and concatenating in my workaround might be adding a lot of unnecessary overhead.

paulsens on 4 Jan 2019

It is definitely adding overhead for me. I added an RTX2080 to my rig but only get +20-30% overall throughput after working around this.