Gensim: Issue in class HashDictionary

Created on 15 May 2018  Â·  7Comments  Â·  Source: RaRe-Technologies/gensim

My goal is to have a dictionary and a bag of words model by streaming over the corpus stream only once (fully online learning). For this, I looked at the Hashing trick described in the tutorials.

In HashDictionary.add_documents() method, there are calls to the doc2bow() method which forms a Bag of Words model for the current document and returns the same.

But the value returned by doc2bow()is not stored in any variable in add_documents(). Thus, although a bag of words model is formed, there is no way to view/process it without using another corpus stream.

I am not sure if this is expected behavior, but by moving the functionality of add_documents to an __iter__() method (and a few other changes), I could get the bag-of-words stream directly from the HashDictionary object. But this entails that the Dictionary wouldn't be formed till I iterate over the HashDictionary object.

Version Information:

Darwin-17.2.0-x86_64-i386-64bit
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.1
SciPy 1.0.0
gensim 3.4.0
FAST_VERSION 0

All 7 comments

HashDictionary is fully initialized from the start, it needs zero passes to "train" itself. It doesn't need any training because it translates strings into integers using a static, fixed hash mapping.

So you don't need to call add_documents() at all. I think it's there only for debugging reasons (to collect a reverse integer=>string mapping if self.debug is set).

@piskvorky I understand that the HashDictionary is fully initialized from the start and it will not require any passes to train itself.
The control flow of HashDictionary when a new object is created is:
__init__() —> add_documents() —> doc2bow()

where add_documents() iterates over the input corpus. (Note: I don't call add_documents(), the __init__() method does if a corpus is provided during object initialization)

Assume that;

tokenized_corpus=CorpusIterator(“dir_path”) #generator streaming a document at a time as a list of tokens

is there any way such that,

dct= HashDictionary(tokenized_corpus)

and

bow=dct.doc2bow(tokenized_corpus) #generator which streams BoW equivalent of each document streamed by corpus iterator.

_With the present implementation, the tokenized_corpus stream is exhausted upon initialization of the HashDictionary object because of the control flow mentioned above._

In the documentation, all examples initialize a dictionary like,

texts = [['human', 'interface', 'computer']]
dct = HashDictionary(texts)
dct.doc2bow(texts[0]) #uses data in texts again which isn’t possible for generator

How can a BoW stream be obtained with a single stream over the data?

Don't pass the corpus to __init__, so that add_dictionary() is not called. Also, set debug=False unless you need the reverse id=>word mapping (will save you a lot of memory):

dct = HashDictionary(id_range=30000, debug=False, documents=None)

@piskvorky Maybe I still am not getting something, but how would you get a BoW stream?

tokenized_corpus = CorpusIterator(“dir_path”) #iterable of iterable of str
dct = HashDictionary(id_range=30000, debug=False, documents=None)
dct.add_documents(tokenized_corpus) #Returns None

If this line was

yield self.doc2bow(document, allow_update=True) #add_documents() is a generator streaming BoW for each input

we'll get BoW directly. Is the exclusion of yield by design? Sorry for the trouble.

Not sure what stream you're talking about. To get a stream of BoW (bag-of-word) vectors, use the dictionary's doc2bow method:

bow_stream = (dct.doc2bow(document) for document in documents)

@JanmajaySingh do the documentation updates in #2073 make it clearer how to use HashDictionary in streaming mode?

Basically, the class needs no initialization, so you shouldn't train it. Look at the examples in the updated docs.

@piskvorky Yes, I submitted a review for the updates.

I was confused mainly because in both, Dictionary and HashDictionary classes, doc2bow() is called as soon as you create an object of the class and pass the _documents_ parameter.

I thought that since for each document, a BoW was formed, it was also being stored in some variable (a list or other collection) and being returned somewhere. But it seems we HAVE to explicitly call ClassName.doc2bow() to access the BoW model. Maybe it was a design decision, because I could not think of any better way to do it.

Also, the examples in the tutorials (the website and the Jupyter notebooks) switch between using a generator for the corpus and some concrete string (maybe for additional clarity). I found this a bit confusing at times. And I did not think of creating a bow_stream from :

(dct.doc2bow(document) for document in documents) 

The "Experiments of Wikipedia Corpus" tutorial is great, but to fully understand it, I had to read the make_wikicorpus.py and wikicorpus.py scripts where the latter is slightly difficult to understand.

But yeah, I also learnt a lot from your tutorials and on studying the source code, so thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

k0nserv picture k0nserv  Â·  3Comments

vlad17 picture vlad17  Â·  4Comments

volj1 picture volj1  Â·  4Comments

dancinghui picture dancinghui  Â·  4Comments

Laubeee picture Laubeee  Â·  3Comments