snippet for tokenizer for text
tk = keras.preprocessing.text.Tokenizer(nb_words=500, filters=keras.preprocessing.text.base_filter(), lower=True, split=" ")
tk.fit_on_texts(x)
x = tk.texts_to_sequences(x)
What exactly the difference between the above code and the bag-of-words
In my opinion, they are the same.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.
the above approach constructs you a list of sequences. it is a list of lists with IDs of the words in occurring order.
bag of words is a matrix with column index for words and row_index for sequence IDs but the content of the cells is one of those: binary, tfidf, count or freq(uencies) you get this matrix with Tokenizer.texts_to_matrix(texts, mode='binary') or Tokenizer.sequences_to_matrix(sequences, mode='binary')
What is the use of embedding layer in keras, I know it turn positive integer indexes to continuous index vectors. But could anybody can give simple example for this?
The embedding layer is in fact a matrix. when you feed for example a sequence the sequence indices are so to say to function mathematically converted in to one-hot-encoded-vector then multiplied with the embeddings matrix to select a column which is then propagated through the network. in the backpropagation phase the error is propagated down to the embedding layer again and the entries there are also adjusted. When the net learns it tries to arrange these vectors in the embedding space such that it can classify them better in the upper layers.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Most helpful comment
the above approach constructs you a list of sequences. it is a list of lists with IDs of the words in occurring order.
bag of words is a matrix with column index for words and row_index for sequence IDs but the content of the cells is one of those: binary, tfidf, count or freq(uencies) you get this matrix with
Tokenizer.texts_to_matrix(texts, mode='binary')orTokenizer.sequences_to_matrix(sequences, mode='binary')