Keras: Why keras using tensorflow backend is much slower than native tensorflow

Created on 5 Sep 2017 · 9Comments · Source: keras-team/keras

Hi, I did a comparison of the speed between keras(theano and tensorflow backend) and tensorflow. I run same toy model on keras using theano and tensorflow backend and native tensorflow. Found that tensorflow is more faster than keras in training process.
The Model is simply an embedding layer followed by two dense layer.
When using tensorflow as backend of keras, I also test the speed of TFOptimizer and Keras Optimizer to avoid embedding layer's influence. Mentioned here #4365
All the experiments run on a single nvidia k40 GPU
keras 2.0.8 theano 0.9.0 tensorflow 1.2.0

Here is the result:
unit: samples/sec
keras-theano: 160
keras-tf-keras_opt: 246
keras-tf-tf_opt: 640
tensorflow: 1625

Tensorflow is about 2.5X faster than keras with tensoflow backend and TFOptimizer.

The scripts:
Keras:

from keras.models import Sequential, Model
import numpy as np
from keras.layers import Dense,Activation, Input
from keras.layers.embeddings import Embedding
#import tensorflow as tf
from keras.optimizers import TFOptimizer
import os

os.environ['CUDA_VISIBLE_DEVICES']='1'

#model=Sequential()
inputs=Input(shape=(50,))
embedding_vec=Embedding(700000,512)(inputs)
d1=Dense(256, activation='sigmoid')(embedding_vec)

d2=Dense(10000, activation='softmax')(d1)

model=Model(inputs=inputs,outputs=d2)
#model.compile(loss='sparse_categorical_crossentropy', optimizer=TFOptimizer(tf.train.GradientDescentOptimizer(0.1)))

model.compile(loss='sparse_categorical_crossentropy',optimizer='sgd')

x_train=np.random.random_integers(0,9999,(3200,50))
y_train=np.random.random_integers(0,9999,(3200,50,1))
print model.summary()
model.fit(x_train, y_train, nb_epoch=20, batch_size=50)

tensorflow:

import tensorflow as tf
import numpy as np
from tqdm import tqdm
import os
os.environ['CUDA_VISIBLE_DEVICES']='3'
inputs=tf.placeholder(shape=(None, 50),dtype=tf.int32)
outputs=tf.placeholder(shape=(None,50),dtype=tf.int32)
#embedding_vec=EmbeddingLayer(70000,128)(inputs)
embedding = tf.get_variable(name = 'embedding', shape=(700000, 512))
embedding_vec = tf.gather(embedding, inputs)

#d1=Dense(128,256)(embedding_vec,scope='dense1')
W1 = tf.get_variable(name='W1',shape=(512,256),dtype=tf.float32)
b1 = tf.get_variable(name='b1',shape=(256,),dtype=tf.float32)
d1 = tf.matmul(tf.reshape(embedding_vec,shape=(-1,512)),W1) + b1
d1 = tf.reshape(d1,shape=(-1,50,256))
d1=tf.sigmoid(d1)

#d2=Dense(256,10000)(d1,scope='predict')
W2 = tf.get_variable(name='W2',shape=(256,10000),dtype=tf.float32)
b2 = tf.get_variable(name='b2',shape=(10000,),dtype=tf.float32)
d2 = tf.matmul(tf.reshape(d1,shape=(-1,256)),W2) + b2
d2 = tf.reshape(d2,shape=(-1,50,10000))


loss=tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=outputs,logits=d2))
opt=tf.train.GradientDescentOptimizer(0.1)
update=opt.minimize(loss)

config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
config.gpu_options.allow_growth=True

x=np.random.random_integers(0,9999,(3200,50))
y=np.random.random_integers(0,9999,(3200,50))
batch_size=50
with tf.Session(config=config) as sess:
    sess.run(tf.global_variables_initializer())
    for i in xrange(20):
        print 'epooch: ', i
        for i in tqdm(range(0,(3200/batch_size)*batch_size,batch_size)):
            x_batch=x[i:i+batch_size]
            y_batch=y[i:i+batch_size]
            _,loss_val=sess.run((update,loss),feed_dict={inputs:x_batch,outputs:y_batch})
            #print loss_val

Many thanks!!

tensorflow

Source

dongfangyixi

👍8

Most helpful comment

In general you can expect tf.nn.sparse_softmax_cross_entropy_with_logits to be way better optimized because it processes logits directly instead of just applying xent to a probability distribution. Good news: it's trivial to use in Keras, when you need it.

fchollet on 5 Sep 2017

👍6

All 9 comments

Okay, I just profiled your script and Keras seems to spend most of it's time in tf.run but twice as more as the tf script. Could you share your keras.json please? I'll investigate further soon.

Dref360 on 5 Sep 2017

Your model features a softmax over 10,000 classes, which is very expensive. Try an apples-to-apples comparison:

inputs = Input(shape=(50,))
embedding_vec = Embedding(700000, 512)(inputs)
d1 = Dense(256, activation='sigmoid')(embedding_vec)
d2 = Dense(10000)(d1)

model = Model(inputs=inputs, outputs=d2)

loss_fn = lambda y_true, y_pred: tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
opt = tf.train.GradientDescentOptimizer(0.1)
model.compile(loss=loss_fn, optimizer=opt)

fchollet on 5 Sep 2017

👍6

fchollet on 5 Sep 2017

👍6

@Dref360 Hi, thanks a lot, here is my keras.json

{
    "image_dim_ordering": "tf", 
    "epsilon": 1e-07, 
    "floatx": "float32", 
    "backend": "tensorflow"
}

dongfangyixi on 6 Sep 2017

@fchollet Hi, thanks a lot, that's very useful.

I have tried with tf.nn.sparse_softmax_cross_entropy_with_logits, but it has rank mismatch problem. I'm still working on that.

Besides, I test the model with tf.nn.softmax_cross_entropy_with_logits, and the outputs classes number is reduced to 500.

here is the result:
keras_tf_tf_opt: about 4000 samples/sec
tensorflow: about 6000 samples/sec

scripts:
keras

inputs = Input(shape=(50,))
embedding_vec = Embedding(700000, 512)(inputs)
d1 = Dense(256, activation='sigmoid')(embedding_vec)
d2 = Dense(500)(d1)

model = Model(inputs=inputs, outputs=d2)

loss_fn = lambda y_true, y_pred: tf.nn.softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
model.compile(loss=loss_fn, optimizer=TFOptimizer(tf.train.GradientDescentOptimizer(0.1)))

the loss function in tensorflow scripts also change to softmax_cross_entropy_with_logits

Any ideas how this happen?

Many thanks!

dongfangyixi on 6 Sep 2017

@dongfangyixi Did you ever figure out the rank mismatch problem?

lminer on 31 Oct 2017

Seems to work if I do this with the loss function. For some reason I have to cast after I squeeze.

def loss_fn(y_true, y_pred):
    y_true = tf.squeeze(y_true)
    y_true = tf.cast(y_true, tf.int32)
    return tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)

Need to do something similar with accuracy to get a reliable metric:

def accuracy_fn(y_true, y_pred):
    y_true = tf.squeeze(y_true)
    y_true = tf.cast(y_true, tf.int64)
    y_pred = tf.argmax(y_pred, 1)
    correct_predictions = tf.equal(y_pred, y_true)
    return tf.reduce_mean(tf.cast(correct_predictions, "float"))

lminer on 31 Oct 2017

👍2

Is this still an issue ?