Keras: difference between bag-of -words and tokenizer

Created on 5 Jan 2017  路  6Comments  路  Source: keras-team/keras

snippet for tokenizer for text

tk = keras.preprocessing.text.Tokenizer(nb_words=500, filters=keras.preprocessing.text.base_filter(), lower=True, split=" ")
tk.fit_on_texts(x)
x = tk.texts_to_sequences(x)

What exactly the difference between the above code and the bag-of-words

stale

Most helpful comment

the above approach constructs you a list of sequences. it is a list of lists with IDs of the words in occurring order.

bag of words is a matrix with column index for words and row_index for sequence IDs but the content of the cells is one of those: binary, tfidf, count or freq(uencies) you get this matrix with Tokenizer.texts_to_matrix(texts, mode='binary') or Tokenizer.sequences_to_matrix(sequences, mode='binary')

All 6 comments

In my opinion, they are the same.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

the above approach constructs you a list of sequences. it is a list of lists with IDs of the words in occurring order.

bag of words is a matrix with column index for words and row_index for sequence IDs but the content of the cells is one of those: binary, tfidf, count or freq(uencies) you get this matrix with Tokenizer.texts_to_matrix(texts, mode='binary') or Tokenizer.sequences_to_matrix(sequences, mode='binary')

What is the use of embedding layer in keras, I know it turn positive integer indexes to continuous index vectors. But could anybody can give simple example for this?

The embedding layer is in fact a matrix. when you feed for example a sequence the sequence indices are so to say to function mathematically converted in to one-hot-encoded-vector then multiplied with the embeddings matrix to select a column which is then propagated through the network. in the backpropagation phase the error is propagated down to the embedding layer again and the entries there are also adjusted. When the net learns it tries to arrange these vectors in the embedding space such that it can classify them better in the upper layers.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

harishkrishnav picture harishkrishnav  路  3Comments

LuCeHe picture LuCeHe  路  3Comments

nryant picture nryant  路  3Comments

somewacko picture somewacko  路  3Comments

Imorton-zd picture Imorton-zd  路  3Comments