Keras: I am still confused about the difference between `Dense` and `TimeDistributedDense`

Created on 22 Mar 2016 · 18Comments · Source: keras-team/keras

I am still confused about the difference between Dense and TimeDistributedDense even though there are already some similar questions asked here and here. People are discussing a lot but no common-agreed conclusions.

And even though, here, @fchollet stated that:

TimeDistributedDense applies a same Dense (fully-connected) operation to every timestep of a 3D tensor.

I still need detailed illustration about what exactly the difference between them.

Source

fluency03

Most helpful comment

that is
Dense layer deal with 2D tensor, and output a 2D tensor
TimeDistributedDense layer deal with 3D tensor, and output a 3D tensor.
The inner operation is the same y=f(Wx+b) where f(x) is activation function, W and b are weight and bias. While in TimeDistributedDense, the operation is applied to every timestep.

ymcui on 23 Mar 2016

👍20 ❤1

All 18 comments

The typical use case of TimeDistributedDense is for processing the output of an Embedding layer or a recurrent layer with return_sequences=True. Then you can transform the hidden representation at each timestep before applying further processing (like pooling or another recurrent layer).

around1991 on 22 Mar 2016

I got an example from here. I am wondering what is the difference if I change the following TimeDistributedDense into Dense

model = Sequential()
model.add(LSTM(hidden_neurons, input_dim=in_out_neurons, return_sequences=True))
model.add(Dropout(0.2))
model.add(TimeDistributedDense(in_out_neurons))  
model.add(Activation("linear"))

fluency03 on 22 Mar 2016

It won't compile, because the dimensions don't match up. Dense expects a 2-dimensional input (batch_size, features), whereas the output of LSTM with return_sequences is 3 dimensional (batch_size, timesteps, features).

around1991 on 22 Mar 2016

👍10 😕2

I did not mean purely changing the term from TimeDistributedDense to Dense. Can we just change the layer from TimeDistributedDense to Dense with changing the dimension as well? What could be the real difference in structure and in results?

fluency03 on 23 Mar 2016

ymcui on 23 Mar 2016

👍20 ❤1

every timestep, do you mean every unfolded unit of the recurrent layer?

fluency03 on 23 Mar 2016

Or, actually, each time step means each char on one sentence? Then, what are the pros and cons of TimeDistributedDense. Does it increase a lot of computation time?

fluency03 on 23 Mar 2016

Yes, timestep mean every unfolded unit of RNN.
for e.g.
sequence = [A,B,C,D,E]
so,
A is in time=1
B is in time=2
...
E is in time=5
If you are going to apply y=f(Wx+b) in each timestep of your input (i.e. your input should be in 3D tensor shape), TimeDistributedDense is your only choice, and there is no pros and cons of it.

ymcui on 23 Mar 2016

I thought y=f(Wx+b) is done in each timestep even using Dense since they are fully connected. So Dense actually only apply activation function to the last time step?

Can I say that: for Dense , it is used in Many-to-One or One-to-One cases. And TimeDistributedDense is used in Many-to-Many and One-to-Many cases?

fluency03 on 23 Mar 2016

👍2

I think you should take a look at the Keras documentation carefully, and perhaps also theano documentation.
Because there is a big difference of Dense and TimeDistribudedDense.
Dense only receives 2D tensor, which means that there is NO time dimension. i.e. an 2D -> 2D conversion.
TimeDistributedDense only receives 3D tensor, which includes time dimension. i.e. an 3D -> 3D conversion.

Q: So Dense actually only apply activation function to the last time step?
A: No, there is no time dimension in Dense layer
Q: for Dense , it is used in Many-to-One or One-to-One cases
A: it is one-to-one
Q: And TimeDistributedDense is used in Many-to-Many and One-to-Many cases?
A: it is many-to-many

ymcui on 23 Mar 2016

👍13

So, the lstm_text_generation example is actually a one-to-one case.

fluency03 on 23 Mar 2016

print('Build model...')
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

if you say the Dense layer, that is one-to-one case, as the previous layer LSTM will return a 2D tensor type, which is the final state of LSTM. And the Dense layer will output a 2D tensor, which is a probability distribution (softmax) of whole vocabulary.

ymcui on 23 Mar 2016

👍4

Thanks a lot. @ymcui I am wondering could you please take a look at my another post here Some interesting results of using this lstm_text_generation example. Need reasonable explanations.. That will be very helpful.

fluency03 on 23 Mar 2016

Hi,
I want to train simple neural network with a data of shape (11,501,40)
I set the input_shape of dense layer also (11,501,40) but is is not working
kindly guide me .
The code and error is given below
from keras.models import Sequential
from keras.layers import Dense
import numpy as np

path="D:/DECASE2017/CNN/all_data_partial.npy"
data=np.load(open(path,'rb'))

X=np.array((data[:11,:501,:40]))# all channels,all rows and 40 columns
Y=np.array((data[:11,:501,40]))# all channels, all rows and only one column no. 40 (e.g class_label)
model = Sequential()
model.add(Dense(12,input_shape=(11,501,40),init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
#compile a model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#fit the model
model.fit(X,Y,nb_epoch=1, batch_size=10)
#evaluate
score = model.evaluate(X,Y)
print("%s: %.2f%%" %(model.metrics_names[1], score[1]*100))

ValueError: Error when checking input: expected dense_80_input to have 4 dimensions, but got array with shape (11, 501, 40)

Thanking you

Khalidhussain1134 on 13 Jul 2017

@fluency03 In your model, why do you add an activation layer with linear activations at the end, i.e. model.add(Activation("linear"))? Does it have any effect?

krikru on 5 Nov 2017

For a regression type problem what dimension should i used to run my code and
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import tensorflow as tf
from matplotlib import pyplot
from sklearn.datasets import make_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD

seed = 7
np.random.seed(seed)
from scipy.io import loadmat
dataset = loadmat('matlab2.mat')
Bx=basantix[:, 50001:99999]
Bx=np.transpose(Bx)
Fx=fx[:, 50001:99999]
Fx=np.transpose(Fx)

from sklearn.cross_validation import train_test_split
Bx_train, Bx_test, Fx_train, Fx_test = train_test_split(Bx, Fx, test_size=0.2, random_state=0)

scaler = StandardScaler() # Class is create as Scaler
scaler.fit(Bx_train) # Then object is created or to fit the data into it
Bx_train = scaler.transform(Bx_train)
Bx_test = scaler.transform(Bx_test)
model = Sequential()
def base_model():

 keras.layers.Dense(Dense(49999, input_shape=(20,), activation='relu'))
 model.add(Dense(20))
 model.add(Dense(49998, init='normal', activation='relu'))
 model.add(Dense(49998, init='normal'))
 model.compile(loss='mean_squared_error', optimizer = 'adam')
 return model

scale = StandardScaler()
Bx = scale.fit_transform(Bx)
Bx = scale.fit_transform(Bx)

clf = KerasRegressor(build_fn=base_model, nb_epoch=100, batch_size=5,verbose=0)

clf.fit(Bx,Fx)
res = clf.predict(Bx)

line below throws an error

clf.score(Fx,res)

kindly provide exact solution

prateekbhadauria on 23 Jun 2018

@around1991 "It won't compile... Dense expects a 2-dimensional input..." This snippet compiles fine:

from keras.layers import TimeDistributed, Dense, Input, Conv1D, MaxPooling1D, Flatten
from keras.models import Model

inputs = Input(shape=(10, 30))
x = Dense(20)(inputs)
x = Conv1D(40, 5)(x)
x = MaxPooling1D(5)(x)
x = Flatten()(x)
x = Dense(3)(x)
model = Model(inputs, x)

print(model.summary())
model.compile(loss='categorical_crossentropy',
                            optimizer='rmsprop',
                            metrics=['acc'])

and here is the model summary:

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 10, 30)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 10, 20)            620       
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 6, 40)             4040      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 1, 40)             0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 40)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 123       
=================================================================
Total params: 4,783
Trainable params: 4,783
Non-trainable params: 0
_________________________________________________________________

This version uses the TimeDistributed and looks like it has the same model summary:

from keras.layers import TimeDistributed, Dense, Input, Conv1D, MaxPooling1D, Flatten
from keras.models import Model

inputs = Input(shape=(10, 30))
x = TimeDistributed(Dense(20))(inputs)
x = Conv1D(40, 5)(x)
x = MaxPooling1D(5)(x)
x = Flatten()(x)
x = Dense(3)(x)
model = Model(inputs, x)

print(model.summary())
model.compile(loss='categorical_crossentropy',
                            optimizer='rmsprop',
                            metrics=['acc'])

Here is the summary:

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 10, 30)            0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 10, 20)            620       
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 6, 40)             4040      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 1, 40)             0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 40)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 123       
=================================================================
Total params: 4,783
Trainable params: 4,783
Non-trainable params: 0
_________________________________________________________________

@fluency03 Did you figure out the answer to your question, I am still confused with the above example! Dense does accept 3D input, It is simply a matrix multiplication (and adding a bias term) nothing wrong with (?, 10, 30) x (30, 20) ---> (?, 10, 20) (matrix is 30x20=600 params) This matrix multiplication is nothing but applying a fully connected (30x20) layer to each of the 10 30-dimensional vectors of the input, which seems to be the same as what TimeDistributed does!

rmanak on 13 Jul 2018

👍4

@rmanak I think although Dense layer accepts 3D input, it flattens the first two dimensions, but TimeDistributed Dense won't flatten the first 2 dimensions (batch_size, time_steps), thus the temporal information is reserved and not mixed.