Keras: difference between bag-of -words and tokenizer

Created on 5 Jan 2017 · 6Comments · Source: keras-team/keras

snippet for tokenizer for text

tk = keras.preprocessing.text.Tokenizer(nb_words=500, filters=keras.preprocessing.text.base_filter(), lower=True, split=" ")
tk.fit_on_texts(x)
x = tk.texts_to_sequences(x)

What exactly the difference between the above code and the bag-of-words

stale

Source

vinayakumarr

Most helpful comment

the above approach constructs you a list of sequences. it is a list of lists with IDs of the words in occurring order.

bag of words is a matrix with column index for words and row_index for sequence IDs but the content of the cells is one of those: binary, tfidf, count or freq(uencies) you get this matrix with Tokenizer.texts_to_matrix(texts, mode='binary') or Tokenizer.sequences_to_matrix(sequences, mode='binary')

n1kt0 on 26 May 2017

👍2

All 6 comments

In my opinion, they are the same.

Imorton-zd on 6 Jan 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

stale[bot] on 23 May 2017

the above approach constructs you a list of sequences. it is a list of lists with IDs of the words in occurring order.

n1kt0 on 26 May 2017

👍2

What is the use of embedding layer in keras, I know it turn positive integer indexes to continuous index vectors. But could anybody can give simple example for this?

vinayakumarr on 26 May 2017

The embedding layer is in fact a matrix. when you feed for example a sequence the sequence indices are so to say to function mathematically converted in to one-hot-encoded-vector then multiplied with the embeddings matrix to select a column which is then propagated through the network. in the backpropagation phase the error is propagated down to the embedding layer again and the entries there are also adjusted. When the net learns it tries to arrange these vectors in the embedding space such that it can classify them better in the upper layers.

n1kt0 on 10 Aug 2017

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 8 Nov 2017

Was this page helpful?

0 / 5 - 0 ratings