Keras: Why keras text classification examples are using Conv1D instead of Conv2D?

Created on 22 Aug 2016  路  7Comments  路  Source: keras-team/keras

I was looking at examples of text classification using Convnets (e.g. imdb_cnn and pretrained_word_embeddings.py). I saw that 1D convolution is used. However, this does not make intuitive sense because we are not considering any local neighboring words in this way. For example if dimension of word embedding is 100 and convolution length is 5, we are only convolving over the emedding dimensions (e.g. [1,...,5], [2,...,6], ...) and not the adjacent words. By contrast, Conv2D can also consider local patterns of neighboring words. So my question is why Conv1D is used in the examples?

Most helpful comment

Conv1D takes care of neighboring words. A filter length of 5 would imply a context window of 5 words, i.e, the word embeddings of 5 words, not 5 elements within a single embedding. Images have height and width, so we use conv2d, sentences are linear lists of words, so conv1d. The "2d" or "3d" specifies how we loop through the matrix; its not the rank of the convolution kernel itself.

All 7 comments

Conv1D takes care of neighboring words. A filter length of 5 would imply a context window of 5 words, i.e, the word embeddings of 5 words, not 5 elements within a single embedding. Images have height and width, so we use conv2d, sentences are linear lists of words, so conv1d. The "2d" or "3d" specifies how we loop through the matrix; its not the rank of the convolution kernel itself.

I see. Thanks for the clarification.

Is the same in TensorFlow? I mean if I have to convert keras code in tensorflow, the conv1D would stay conv1D or should it change in conv2D?

In the example imdb_cnn, when they use conv1d layer after embedding layer of dimension (None, 400,50), the resulting output dimension of conv1d is (None,398,250). It seems that the conv1d layer is only applied across the length of the particular word, not along the number of words.

So how can you say that conv1d takes care of neighbouring words? In their example a filter length of 5 is only operating along the length of the word not along the context window of 5 words.

I think you are wrong @alihassanmirza because if operating only over singular words the output would be of 400.

The embedding layer output is (batch_size, sequence_length, output_dim) while the conv1d input is (batch, steps, channels). So in this case, the dimension of the vector space generated by the embedding layer represents the number of channels (like RGB in an image), while the context window will move on the other dimension (steps) which is the number of words in the sentence. A filter length of 5 would imply a context windows of 5 words in every channel, as @farizrahman4u said. Am i right?

@Skullflow yes you are obs right

Was this page helpful?
0 / 5 - 0 ratings