Gensim: word2vec vocab building is not multi-threaded

Created on 15 Jul 2015  Â·  11Comments  Â·  Source: RaRe-Technologies/gensim

I've been running a couple billion words through word2vec, using a 32-core EC2 instance. Actually training the model is blazing fast (relatively), but populating the vocab dict is painfully slow.

I forked and added multi-threading, but max_vocab_size pruning and progress logging were difficult to implement (you can't prune the vocab list of a single worker, since there could be more instances of a word in a different worker's sentences, and pruning after all the vocab counts have been merged defeats the purpose entirely).

Because I had to disable those two things, I'm not going to submit a pull request. Here's what I changed, perhaps @piskvorky or someone with more time could get vocab size pruning working with multithreading?

https://github.com/btym/gensim/blob/develop/gensim/models/word2vec.py#L480

difficulty medium performance

Most helpful comment

I have some Cython code from spaCy that is helpful here:

  • The spacy.strings.hash_string function uses Murmurhash to produce a 64-bit integer key
  • The preshed library has a very fast and very memory efficient 64-bit to 64-bit hash table, implemented using open addressing and linear probing. The table is slightly faster than Google's dense_hash_map for my use cases.

The strategy is to have two tables: an int-to-int table that stores the counts and only knows the hashes of the strings, and a Python table mapping the hashes to the strings. The Python table is only updated when we cross the minimum frequency threshold.

I wrote this up very quickly, but so far the performance seems much much faster than scan_vocab, with much better memory use. I'm up to 1.5b words, after running for 10 minutes.

from preshed.counter import PreshCounter
from spacy.strings import hash_string

class Corpus(object):
    def __init__(self, directory, min_freq=10):
        self.directory = directory
        self.counts = PreshCounter()
        self.strings = {}
        self.min_freq = min_freq

    def count_doc(self, words):
        # Get counts for this document
        doc_counts = PreshCounter()
        doc_strings = {}
        for word in words:
            key = hash_string(word)
            doc_counts.inc(key, 1)
            doc_strings[key] = word

        n = 0
        for key, count in doc_counts:
            self.counts.inc(key, count)
            # TODO: Why doesn't inc return this? =/
            corpus_count = self.counts[key]
            # Remember the string when we exceed min count
            if corpus_count >= self.min_freq and (corpus_count - count) < self.min_freq:
                 self.strings[key] = doc_strings[key]
            n += count
        return n

    def __iter__(self):
        for text_loc in iter_dir(self.directory):
            with io.open(text_loc, 'r', encoding='utf8') as file_:
                for sent_str in file_:
                    yield sent_str.split()

All 11 comments

Can you say a little more about your use case?

What is the performance now, what performance do you need, and why?

We can certainly optimize this part if there's a compelling reason to :)

Note: threading doesn't generally help much with Python, because of the global interpreter lock (GIL). And it looks like you're putting all sentences into plain lists, in RAM, which wouldn't scale (no streaming).

It's going through a somewhat large history of github events (the past 7 months, ~1.8 billion words). Memory is not an issue in my case, but using your implementation building the vocab list is almost as slow as training the network, since it's only running on one core out of 32 (to be clear, threads will run properly on separate cores if core affinity is set correctly, but what I added could have been done just as easily with multiprocessing).

I realize the way I'm doing it is specific to my use case (not streaming, no max vocab size), but there's no reason for vocab building not to be muti-thread/multi-process, since it's such a huge performance loss. Streaming could be solved by having one iterator shared between threads and have the iterator handle locking. Not sure how max vocab size would work.

Thread affinity is not related to GIL. I'm surprised you saw an improvement with multithreading at all -- do you have any numbers?

Anyway, yes, parallelizing the vocab building sounds useful. If it clashes with the max vocab size pruning, I'd rather drop that and keep the parallel version. The parallelization optimization scenario sounds more commonly useful than pruning... although it brings more maintenance headaches (multiprocessing works differently on Windows). Do you have a way of testing on Windows?

If you find a way to build the vocab in parallel, while streaming the input (not everything in RAM), and being robust across platforms, can you please open a PR @btym?

For inspiration, see the actual training a few lines lower, which is also streamed and parallelized.

I may have been mistaken and read the wrong thing, I only tried the changes
on a small set since I was doing this while the other model was still
training :) Should be able to switch to multiprocessing.

Can test on windows/osx/linux conveniently. Will try to submit a PR if I
have time this week/weekend.

On Wed, Jul 15, 2015, 4:38 AM Radim Řehůřek [email protected]
wrote:

Thread affinity is not related to GIL. I'm surprised you saw an
improvement with multithreading at all -- do you have any numbers?

Anyway, yes, parallelizing the vocab building sounds useful. If it clashes
with the max vocab size pruning, I'd rather drop that and keep the parallel
version. The parallelization optimization scenario sounds more commonly
useful than pruning... although it brings more maintenance headaches
(multiprocessing works differently on Windows). Do you have a way of
testing on Windows?

If you find a way to build the vocab in parallel, while streaming the
input (not everything in RAM), can you please open a PR @btym
https://github.com/btym?

For inspiration, see the actual training a few lines lower, which is also
streamed.

—
Reply to this email directly or view it on GitHub
https://github.com/piskvorky/gensim/issues/400#issuecomment-121587962.

Hey guys,

Been following this issue as this is also a big concern for me. Training on 2B+ word sets takes a whole lot of time, but most of it is actually spent scanning the vocabulary.

Based on @piskvorky suggestion, I went ahead and added multiprocess support to word2vec scan_vocab, removing max vocab pruning as it was much more complicated to deal with.

Haven't run any quantitative benches but planning to do so tomorrow to compare scaling on many cores and with non-trivial LineSentence implementations.

For now the only hiccup is that my implementation is using Counters() to easily merge each worker vocab, and that was only introduced in 2.7. Have a look at the PR #406 and tell me what you think!

I need this feature too. I have about 500G training data in a directory, which consists about 500 files. It will be helpful if multi files can be scanned at the same in the vocabulary building step.

I have some Cython code from spaCy that is helpful here:

  • The spacy.strings.hash_string function uses Murmurhash to produce a 64-bit integer key
  • The preshed library has a very fast and very memory efficient 64-bit to 64-bit hash table, implemented using open addressing and linear probing. The table is slightly faster than Google's dense_hash_map for my use cases.

The strategy is to have two tables: an int-to-int table that stores the counts and only knows the hashes of the strings, and a Python table mapping the hashes to the strings. The Python table is only updated when we cross the minimum frequency threshold.

I wrote this up very quickly, but so far the performance seems much much faster than scan_vocab, with much better memory use. I'm up to 1.5b words, after running for 10 minutes.

from preshed.counter import PreshCounter
from spacy.strings import hash_string

class Corpus(object):
    def __init__(self, directory, min_freq=10):
        self.directory = directory
        self.counts = PreshCounter()
        self.strings = {}
        self.min_freq = min_freq

    def count_doc(self, words):
        # Get counts for this document
        doc_counts = PreshCounter()
        doc_strings = {}
        for word in words:
            key = hash_string(word)
            doc_counts.inc(key, 1)
            doc_strings[key] = word

        n = 0
        for key, count in doc_counts:
            self.counts.inc(key, count)
            # TODO: Why doesn't inc return this? =/
            corpus_count = self.counts[key]
            # Remember the string when we exceed min count
            if corpus_count >= self.min_freq and (corpus_count - count) < self.min_freq:
                 self.strings[key] = doc_strings[key]
            n += count
        return n

    def __iter__(self):
        for text_loc in iter_dir(self.directory):
            with io.open(text_loc, 'r', encoding='utf8') as file_:
                for sent_str in file_:
                    yield sent_str.split()

This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts).

I would suggest mentioning this in the documentation explicitly.

@windj007 Do you happen to have example code on replacing with placeholders? Assume it is some simple regexes. Would be great to add it to the documentation.

For Wikipedia, a regex like this one worked for me:

BAD_TOKENS_RE = re.compile(r'''^(https?://.*|[()',";?=\.#=:0-9/\-%«»*—|]+|.*\.(jpg|png|gif|svg)|.)$''')

It matches such types of tokens as URLs, special symbols and filenames.

At the moment I wrote my previous comment, I had a set of regexes and really replaced tokens with placeholders. Now, I merged all the regexs into this one and just drop the matched tokens.

This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts).

I would suggest mentioning this in the documentation explicitly.

have u used above code?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

k0nserv picture k0nserv  Â·  3Comments

bgokden picture bgokden  Â·  3Comments

sairampillai picture sairampillai  Â·  3Comments

ahmedbhabbas picture ahmedbhabbas  Â·  4Comments

jeradf picture jeradf  Â·  4Comments