Hi, I did a comparison of the speed between keras(theano and tensorflow backend) and tensorflow. I run same toy model on keras using theano and tensorflow backend and native tensorflow. Found that tensorflow is more faster than keras in training process.
The Model is simply an embedding layer followed by two dense layer.
When using tensorflow as backend of keras, I also test the speed of TFOptimizer and Keras Optimizer to avoid embedding layer's influence. Mentioned here #4365
All the experiments run on a single nvidia k40 GPU
keras 2.0.8 theano 0.9.0 tensorflow 1.2.0
Here is the result:
unit: samples/sec
keras-theano: 160
keras-tf-keras_opt: 246
keras-tf-tf_opt: 640
tensorflow: 1625
Tensorflow is about 2.5X faster than keras with tensoflow backend and TFOptimizer.
The scripts:
Keras:
from keras.models import Sequential, Model
import numpy as np
from keras.layers import Dense,Activation, Input
from keras.layers.embeddings import Embedding
#import tensorflow as tf
from keras.optimizers import TFOptimizer
import os
os.environ['CUDA_VISIBLE_DEVICES']='1'
#model=Sequential()
inputs=Input(shape=(50,))
embedding_vec=Embedding(700000,512)(inputs)
d1=Dense(256, activation='sigmoid')(embedding_vec)
d2=Dense(10000, activation='softmax')(d1)
model=Model(inputs=inputs,outputs=d2)
#model.compile(loss='sparse_categorical_crossentropy', optimizer=TFOptimizer(tf.train.GradientDescentOptimizer(0.1)))
model.compile(loss='sparse_categorical_crossentropy',optimizer='sgd')
x_train=np.random.random_integers(0,9999,(3200,50))
y_train=np.random.random_integers(0,9999,(3200,50,1))
print model.summary()
model.fit(x_train, y_train, nb_epoch=20, batch_size=50)
tensorflow:
import tensorflow as tf
import numpy as np
from tqdm import tqdm
import os
os.environ['CUDA_VISIBLE_DEVICES']='3'
inputs=tf.placeholder(shape=(None, 50),dtype=tf.int32)
outputs=tf.placeholder(shape=(None,50),dtype=tf.int32)
#embedding_vec=EmbeddingLayer(70000,128)(inputs)
embedding = tf.get_variable(name = 'embedding', shape=(700000, 512))
embedding_vec = tf.gather(embedding, inputs)
#d1=Dense(128,256)(embedding_vec,scope='dense1')
W1 = tf.get_variable(name='W1',shape=(512,256),dtype=tf.float32)
b1 = tf.get_variable(name='b1',shape=(256,),dtype=tf.float32)
d1 = tf.matmul(tf.reshape(embedding_vec,shape=(-1,512)),W1) + b1
d1 = tf.reshape(d1,shape=(-1,50,256))
d1=tf.sigmoid(d1)
#d2=Dense(256,10000)(d1,scope='predict')
W2 = tf.get_variable(name='W2',shape=(256,10000),dtype=tf.float32)
b2 = tf.get_variable(name='b2',shape=(10000,),dtype=tf.float32)
d2 = tf.matmul(tf.reshape(d1,shape=(-1,256)),W2) + b2
d2 = tf.reshape(d2,shape=(-1,50,10000))
loss=tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=outputs,logits=d2))
opt=tf.train.GradientDescentOptimizer(0.1)
update=opt.minimize(loss)
config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
config.gpu_options.allow_growth=True
x=np.random.random_integers(0,9999,(3200,50))
y=np.random.random_integers(0,9999,(3200,50))
batch_size=50
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
for i in xrange(20):
print 'epooch: ', i
for i in tqdm(range(0,(3200/batch_size)*batch_size,batch_size)):
x_batch=x[i:i+batch_size]
y_batch=y[i:i+batch_size]
_,loss_val=sess.run((update,loss),feed_dict={inputs:x_batch,outputs:y_batch})
#print loss_val
Many thanks!!
Okay, I just profiled your script and Keras seems to spend most of it's time in tf.run but twice as more as the tf script. Could you share your keras.json please? I'll investigate further soon.
Your model features a softmax over 10,000 classes, which is very expensive. Try an apples-to-apples comparison:
inputs = Input(shape=(50,))
embedding_vec = Embedding(700000, 512)(inputs)
d1 = Dense(256, activation='sigmoid')(embedding_vec)
d2 = Dense(10000)(d1)
model = Model(inputs=inputs, outputs=d2)
loss_fn = lambda y_true, y_pred: tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
opt = tf.train.GradientDescentOptimizer(0.1)
model.compile(loss=loss_fn, optimizer=opt)
In general you can expect tf.nn.sparse_softmax_cross_entropy_with_logits to be way better optimized because it processes logits directly instead of just applying xent to a probability distribution. Good news: it's trivial to use in Keras, when you need it.
@Dref360 Hi, thanks a lot, here is my keras.json
{
"image_dim_ordering": "tf",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "tensorflow"
}
@fchollet Hi, thanks a lot, that's very useful.
I have tried with tf.nn.sparse_softmax_cross_entropy_with_logits, but it has rank mismatch problem. I'm still working on that.
Besides, I test the model with tf.nn.softmax_cross_entropy_with_logits, and the outputs classes number is reduced to 500.
here is the result:
keras_tf_tf_opt: about 4000 samples/sec
tensorflow: about 6000 samples/sec
scripts:
keras
inputs = Input(shape=(50,))
embedding_vec = Embedding(700000, 512)(inputs)
d1 = Dense(256, activation='sigmoid')(embedding_vec)
d2 = Dense(500)(d1)
model = Model(inputs=inputs, outputs=d2)
loss_fn = lambda y_true, y_pred: tf.nn.softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
model.compile(loss=loss_fn, optimizer=TFOptimizer(tf.train.GradientDescentOptimizer(0.1)))
the loss function in tensorflow scripts also change to softmax_cross_entropy_with_logits
Any ideas how this happen?
Many thanks!
@dongfangyixi Did you ever figure out the rank mismatch problem?
Seems to work if I do this with the loss function. For some reason I have to cast after I squeeze.
def loss_fn(y_true, y_pred):
y_true = tf.squeeze(y_true)
y_true = tf.cast(y_true, tf.int32)
return tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
Need to do something similar with accuracy to get a reliable metric:
def accuracy_fn(y_true, y_pred):
y_true = tf.squeeze(y_true)
y_true = tf.cast(y_true, tf.int64)
y_pred = tf.argmax(y_pred, 1)
correct_predictions = tf.equal(y_pred, y_true)
return tf.reduce_mean(tf.cast(correct_predictions, "float"))
Is this still an issue ?
Closing as this is resolved
Most helpful comment
In general you can expect
tf.nn.sparse_softmax_cross_entropy_with_logitsto be way better optimized because it processes logits directly instead of just applying xent to a probability distribution. Good news: it's trivial to use in Keras, when you need it.