Keras: Why keras text classification examples are using Conv1D instead of Conv2D?

Created on 22 Aug 2016 · 7Comments · Source: keras-team/keras

I was looking at examples of text classification using Convnets (e.g. imdb_cnn and pretrained_word_embeddings.py). I saw that 1D convolution is used. However, this does not make intuitive sense because we are not considering any local neighboring words in this way. For example if dimension of word embedding is 100 and convolution length is 5, we are only convolving over the emedding dimensions (e.g. [1,...,5], [2,...,6], ...) and not the adjacent words. By contrast, Conv2D can also consider local patterns of neighboring words. So my question is why Conv1D is used in the examples?

Source

armancohan

👍6

Most helpful comment

Conv1D takes care of neighboring words. A filter length of 5 would imply a context window of 5 words, i.e, the word embeddings of 5 words, not 5 elements within a single embedding. Images have height and width, so we use conv2d, sentences are linear lists of words, so conv1d. The "2d" or "3d" specifies how we loop through the matrix; its not the rank of the convolution kernel itself.

farizrahman4u on 22 Aug 2016

👍41

All 7 comments

farizrahman4u on 22 Aug 2016

👍41

I see. Thanks for the clarification.

armancohan on 22 Aug 2016

Is the same in TensorFlow? I mean if I have to convert keras code in tensorflow, the conv1D would stay conv1D or should it change in conv2D?

SqrtPapere on 14 Jan 2018

👍1

In the example imdb_cnn, when they use conv1d layer after embedding layer of dimension (None, 400,50), the resulting output dimension of conv1d is (None,398,250). It seems that the conv1d layer is only applied across the length of the particular word, not along the number of words.

So how can you say that conv1d takes care of neighbouring words? In their example a filter length of 5 is only operating along the length of the word not along the context window of 5 words.

alihassanmirza on 30 May 2018

I think you are wrong @alihassanmirza because if operating only over singular words the output would be of 400.

SqrtPapere on 30 May 2018

The embedding layer output is (batch_size, sequence_length, output_dim) while the conv1d input is (batch, steps, channels). So in this case, the dimension of the vector space generated by the embedding layer represents the number of channels (like RGB in an image), while the context window will move on the other dimension (steps) which is the number of words in the sentence. A filter length of 5 would imply a context windows of 5 words in every channel, as @farizrahman4u said. Am i right?