Gensim: Gensim Doc2Vec model Segmentation Faulting for Large Corpus

Created on 24 Jul 2020 · 31Comments · Source: RaRe-Technologies/gensim

Problem description

What are you trying to achieve?
I was trying to train a doc2vec model on a corpus of 10M (10 millions) documents for my dataset roughly having a length of ~5000 words on average. The idea was to generate a semantic search index on these documents using the doc2vec model.

What is the expected result?
I was expecting it to be completed successfully as I tested for the smaller dataset. On a smaller dataset of size 100K documents, it worked fine and I was able to do basic benchmarking for the search index, which successfully passed the criteria.

What are you seeing instead?
When I started training on the 10M dataset. After building the vocabulary the training of the doc2vec model stoped and resulted in Segmentation Fault.

Steps/code/corpus to reproduce

Include full tracebacks, logs, and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
Here is the link to the example to reproduce it. It uses the following libraries (unfortunately I could not make a virtual env. due to some issues.) other than mentioned below.
RandomWords

Attached is the logging file.
logging_progress.log

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
NumPy 1.19.0
SciPy 1.5.1
gensim 3.8.3
FAST_VERSION 1

Here is the google group thread for a detailed discussion

Source

mohsin-ashraf

Most helpful comment

I wrote:

You've provided a clear map to reproducing & what's likely involved (a signed 32-bit int overflow), and I suspect the recipe to trigger can be made even smaller/faster. (EG: instead of 7.2M 300D vectors w/ 100-word training docs, 400K 6000D vectors w/ 1-word training docs is likely to trigger the same overflow - so no 300GB+ test file or long slow vocab-scan necessary.)

Confirmed that with a test file created by...

with open('400klines', 'w') as f:
    for _ in range(400000):
        f.write('a\n')

...the following is enough to trigger a fault...

model = Doc2Vec(corpus_file='400klines', min_count=1, vector_size=6000)

Further, it may be sufficient to change the 3 lines in doc2vec_corpusfile.pyx that read...

 cdef int _doc_tag = start_doctag

...to...

 cdef long long _doc_tag = start_doctag

That avoids the crash in the tiny test case above.

gojomo on 29 Jul 2020

👍2

All 31 comments

For pinning down the exact number over which it always segmentations faults and below which it always works, I am trying some other experiments as well. For now, it works fine for 5M documents and gives the expected results.

mohsin-ashraf on 24 Jul 2020

Thanks for the effort to create a reproducible example with random data! But, what is the significance of list_of_lengths.pickle? Can the code be updated to not use an opaque pickled object for necessary length parameters?

gojomo on 24 Jul 2020

I tried my best to reproduce the exact scenario that I had. The list_of_lengths.pickle contains the length of the documents in my corpus. Although we eliminate it using the random module of python. What do you suggest? I'll update the code as needed.

mohsin-ashraf on 24 Jul 2020

If it's a simple list of int document lengths, a file with one number per line should work as well. (And, if simply making every doc the same length works to reproduce, that'd be just as good.) This data, even with the RandomWords as texts, creates the crash for you?

gojomo on 24 Jul 2020

a file with one number per line should work as well

I'll convert the pickle to a text file with one number per line.

This data, even with the RandomWords as texts, creates the crash for you?

Yes it creates the segmentation fault.

if simply making every doc the same length works to reproduce, that'd be just as good

Will check this aswell.

mohsin-ashraf on 24 Jul 2020

Another fairly quick test worth running: for a corpus_file.txt that triggers the problem, if you try the more generic method of specifying a corpus – a Python iterable, such as one that's created by the LineSentence utility class on that same corpus_file.txt – does it crash at the same place? (It might be far slower that way, especially w/ 30 workers – but if it only crashes with the corpus_file-style specification it points at different code paths, perhaps with unwise implementation limits, as the culprit.)

gojomo on 24 Jul 2020

👍1

Updated the repository for the pickle to a text file.

a file with one number per line should work as well

I'll convert the pickle to a text file with one number per line.

mohsin-ashraf on 24 Jul 2020

Thanks! These are word counts for the docs, right? I see that of 10,000,000 docs, about 781K are over 10,000 words. While this should be accepted OK by gensim (& certainly not create a crash), just FYI: there is an internal implementation limit where words past the 10,000th of a text are silently ignored. In order for longer docs to be considered by Doc2Vec, they'd need to be broken into 10K-long docs which then share the same document tag (which actually isn't yet possible in the corpus_file mode).

gojomo on 24 Jul 2020

Thanks for letting me know about some of the internal details of Doc2Vec. I have just checked the top 100 most lengthy documents in my corpus, they seem to contain 200K+ tokens some of them contain even 3M tokens, could that be a problem. Following is the table of lengths of the top 50 most lengthy documents.
|Most lengthy documents Length|
|-----|
| 316691 |
| 316703 |
| 316742 |
| 316773 |
| 316783 |
| 316797 |
| 316817 |
| 316823 |
| 316850 |
| 316865 |
| 316865 |
| 316929 |
| 316929 |
| 316929 |
| 317139 |
| 317195 |
| 317195 |
| 317307 |
| 317733 |
| 318162 |
| 344057 |
| 351887 |
| 356643 |
| 356643 |
| 363108 |
| 363271 |
| 363271 |
| 363271 |
| 373338 |
| 385525 |
| 388338 |
| 388382 |
| 388594 |
| 388732 |
| 388923 |
| 397950 |
| 448824 |
| 448824 |
| 450107 |
| 455986 |
| 467019 |
| 485819 |
| 535723 |
| 652092 |
| 659184 |
| 659184 |
| 659184 |
| 749399 |
| 2535337 |
| 3523532 |

mohsin-ashraf on 24 Jul 2020

It shouldn't cause a crash; it will mean only the 1st 10K tokens of those docs will be used for training. (It might be a factor involved in the crash, I'm not sure.)

gojomo on 24 Jul 2020

When you run this code, how big is the corpus_file.txt created by the 1st 15 lines of your reproduce_error.py script? (My rough estimate is somewhere above 300GB.) How big is your true-data corpus_file.txt?

When you report that a 5M-line variant works, is that with half your real data, or half the RandomWords data, or both? (In any 5M line cases you've run, how large are the corpus_file.txt files involved?)

gojomo on 24 Jul 2020

Also, note: if you can .save() the model after the .build_vocab() step, then it may be possible to just use a .load() to restore the model to the state right before a .train() triggers the fault - much quicker than repeating the vocab-scan each time. See the PR I made vs your error repo for quickie (untested) example of roughly what I mean.

With such a quick-reproduce recipe, we would then want to try:

(1) the non-corpus-file path. instead of:

model.train(corpus_file=corpus_path, total_examples=model.corpus_count,total_words=model.corpus_total_words, epochs=model.epochs)

...try...

from gensim.models.doc2vec import TaggedLineDocument
model.train(documents=TaggedLineDocument(corpus_path), total_examples=model.corpus_count,total_words=model.corpus_total_words, epochs=model.epochs)

If this starts training without the same instant fault in the corpus_file= variant, as I suspect it will, we'll know the problem is specific to the corpus_file code. (No need to let the training complete.)

(2) getting a core-dump of the crash, & opening it in gdb to view the call stack(s) at the moment of the fault, which might also closely identify whatever bug/limit is being mishandled. (The essence is: (a) ensure the environment is set to dump a core in case of segfault, usually via a ulimit adjustment; (b) run gdb -c COREFILENAME; (c) run gdb commands to inspect state, most importantly for our purposes: thread apply all bt (show traces of all threads).

gojomo on 24 Jul 2020

👍1

When you run this code, how big is the corpus_file.txt created by the 1st 15 lines of your reproduce_error.py script? (My rough estimate is somewhere above 300GB.) How big is your true-data corpus_file.txt?

Its 315GB in size.

When you report that a 5M-line variant works, is that with half your real data, or half the RandomWords data, or both? (In any 5M line cases you've run, how large are the corpus_file.txt files involved?)

It is half of my real data. The exact size of the corpus file is 157GB.

mohsin-ashraf on 25 Jul 2020

Also, note: if you can .save() the model _after_ the .build_vocab() step, then it may be possible to just use a .load() to restore the model to the state right before a .train() triggers the fault - much quicker than repeating the vocab-scan each time. See the PR I made vs your error repo for quickie (untested) example of roughly what I mean.

With such a quick-reproduce recipe, we would then want to try:

(1) the non-corpus-file path. instead of:
model.train(corpus_file=corpus_path, total_examples=model.corpus_count,total_words=model.corpus_total_words, epochs=model.epochs)
...try...
from gensim.models.doc2vec import TaggedLineDocument
model.train(documents=TaggedLineDocument(corpus_path), total_examples=model.corpus_count,total_words=model.corpus_total_words, epochs=model.epochs)
If this starts training without the same instant fault in the corpus_file= variant, as I suspect it will, we'll know the problem is specific to the corpus_file code. (No need to let the training complete.)

(2) getting a core-dump of the crash, & opening it in gdb to view the call stack(s) at the moment of the fault, which might also closely identify whatever bug/limit is being mishandled. (The essence is: (a) ensure the environment is set to dump a core in case of segfault, usually via a ulimit adjustment; (b) run gdb -c COREFILENAME; (c) run gdb commands to inspect state, most importantly for our purposes: thread apply all bt (show traces of all threads).

Thanks for that! I am currently working on finding the exact number above which we always get a segmentation fault and below which we always get the training successful. I am pretty close to the number (using binary search).

For other options TaggedLineDocument I'll give them a separate look.

mohsin-ashraf on 25 Jul 2020

I have finally found the exact number over which the doc2vec model always gives segmentation fault, and under which it always starts training (although I did not let the training process to complete). 7158293 is the exact number on which and below which the doc2vec model starts successful training whereas the if we increase the number even by one it gives the segmentation fault. I used the synthetic dataset and used documents on lengths 100 tokens ony to speed up the process.

mohsin-ashraf on 27 Jul 2020

Can you post the full log (at least INFO level) from that run?

piskvorky on 27 Jul 2020

Good to hear of your progress, & that's a major clue, as 7158293 * 300 dimensions = 2,147,487,900, suspiciously close to 2^31 (2,147,483,648).

That's strongly suggestive that the problem is some misuse of a signed 32-bit int where a wider int type should be used, and indexing overflow is causing the crash.

Have you been able to verify my theory that training would get past that quick crash if the .train() is called with a TaggedLineDocument corpus passed as documents instead of a filename passed as corpus_file? (This could re-use a model on the crashing file that was .save()d after .build_vocab(), if you happen to have created one.)

(Another too-narow-type problem, though one that's only caused missed training & not a segfault, is #2679. All the cython code should get a scan for potential use of signed/unsigned 32-bit ints where 64-bits would be required for the very-large, > 2GB/4GB array-indexing that's increasingly common in *2Vec models.

gojomo on 27 Jul 2020

Thanks. I wanted to eyeball the log in order to spot any suspicious numbers (signs of overflow), but @gojomo's observation above is already a good smoking gun.

If you want this resolved quickly the best option might be to check for potential int32-vs-int64 variable problems yourself. It shouldn't be too hard, the file is here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec_corpusfile.pyx (look for ints where there should be long long).

piskvorky on 27 Jul 2020

Should I make a PR after updating the code?

mohsin-ashraf on 27 Jul 2020

Of course :)

piskvorky on 27 Jul 2020

👍1

I have failed the circle CI checks could you take a look at my PR here and let me know what I'm doing wrong. I wasn't able to install the test and wasn't able to run the tox * commands

mohsin-ashraf on 27 Jul 2020

Good to hear of your progress, & that's a major clue, as 7158293 * 300 dimensions = 2,147,487,900, suspiciously close to 2^31 (2,147,483,648).

That's strongly suggestive that the problem is some misuse of a signed 32-bit int where a wider int type should be used, and indexing overflow is causing the crash.

Have you been able to verify my theory that training would get past that quick crash if the .train() is called with a TaggedLineDocument corpus passed as documents instead of a filename passed as corpus_file? (This could re-use a model on the crashing file that was .save()d after .build_vocab(), if you happen to have created one.)

(Another too-narow-type problem, though one that's only caused missed training & not a segfault, is #2679. All the cython code should get a scan for potential use of signed/unsigned 32-bit ints where 64-bits would be required for the very-large, > 2GB/4GB array-indexing that's increasingly common in *2Vec models.

Using TaggedLineDocument did not trigger any error for the larger dataset!. I did not complete the training job though.

mohsin-ashraf on 27 Jul 2020

If you want this resolved quickly the best option might be to check for potential int32-vs-int64 variable problems yourself. It shouldn't be too hard, the file is here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec_corpusfile.pyx (look for ints where there should be long long).

But note:

the real issue might be in code shared with word2vec_inner.pyx or doc2vec_inner.pyx or word2vec_corpusfile.pyx, or definitions in any of the matching .pxd files;
the offending type might have some name more like np.uint_32 than just int (even though I'm not sure np.uint_32 itself, which should go to 2^32, could be the problem), or be some other alias obscuring the raw types involved, or be hidden in some other library function used incorrectly;
most int usages there are probably OK, especially if in pure Python/Cython code, and a simple replace of all int refs with long long refs (as in your PR) is likely to break other things

If you're able to do the gdb-related steps – saving a core from an actual fault, and using gdb to show the backtraces of all threads at the moment of the error – that may point more directly to a single function, or few lines, where the wrong types are being used, or an oversized Python int is shoved to a narrower type where it overflows. And when trying fixes, or adding debugging code, you'll need to be able to set up your whole local install for rebuilding the compiled shared libraries – which will require C/C++ build tools, and the cython package, & an explicit 'build' step after editing the files – in order to run the code locally (either via the whole unit-test suite or just in your custom code trigger).

gojomo on 27 Jul 2020

It's just a quick run of gdb the result for segmentation fault is shown in the below image.
Screenshot from 2020-07-28 09-37-30

Let me know if its helpful, or any further instructions?

mohsin-ashraf on 28 Jul 2020

The textual output of the gdb command thread apply all bt will be most informative. (After that, if the offending thread is obvious, some more inspection of its local variables, in the crashing frame and perhaps a few frames up, may also be helpful... but the bt backtraces first.)

gojomo on 28 Jul 2020

(gdb) bt
#0  0x00007fff7880bb30 in saxpy_kernel_16 ()
   from /home/mohsin/.local/lib/python3.6/site-packages/scipy/spatial/../../scipy.libs/libopenb
lasp-r0-085ca80a.3.9.so
#1  0x00007fff7880bd4f in saxpy_k_NEHALEM ()
   from /home/mohsin/.local/lib/python3.6/site-packages/scipy/spatial/../../scipy.libs/libopenb
lasp-r0-085ca80a.3.9.so
#2  0x00007fff783042cb in saxpy_ ()
   from /home/mohsin/.local/lib/python3.6/site-packages/scipy/spatial/../../scipy.libs/libopenb
lasp-r0-085ca80a.3.9.so
#3  0x00007fff1f773912 in ?? ()
   from /home/mohsin/.local/lib/python3.6/site-packages/gensim/models/doc2vec_corpusfile.cpytho
n-36m-x86_64-linux-gnu.so
#4  0x00007fff1f77459f in ?? ()
   from /home/mohsin/.local/lib/python3.6/site-packages/gensim/models/doc2vec_corpusfile.cpython-36m-x86_64-linux-gnu.so
#5  0x000000000050ac25 in ?? ()
#6  0x000000000050d390 in _PyEval_EvalFrameDefault ()
#7  0x0000000000508245 in ?? ()
#8  0x0000000000509642 in _PyFunction_FastCallDict ()
#9  0x0000000000595311 in ?? ()
#10 0x00000000005a067e in PyObject_Call ()
#11 0x000000000050d966 in _PyEval_EvalFrameDefault ()
#12 0x0000000000508245 in ?? ()
#13 0x0000000000509642 in _PyFunction_FastCallDict ()
#14 0x0000000000595311 in ?? ()
#15 0x00000000005a067e in PyObject_Call ()
#16 0x000000000050d966 in _PyEval_EvalFrameDefault ()
#17 0x0000000000509d48 in ?? ()
#18 0x000000000050aa7d in ?? ()
#19 0x000000000050c5b9 in _PyEval_EvalFrameDefault ()
---Type <return> to continue, or q <return> to quit---return
#20 0x0000000000509d48 in ?? ()
#21 0x000000000050aa7d in ?? ()
#22 0x000000000050c5b9 in _PyEval_EvalFrameDefault ()
#23 0x0000000000509455 in _PyFunction_FastCallDict ()
#24 0x0000000000595311 in ?? ()
#25 0x00000000005a067e in PyObject_Call ()
#26 0x00000000005e1b72 in ?? ()
#27 0x0000000000631f44 in ?? ()
#28 0x00007ffff77cc6db in start_thread (arg=0x7ffc97fff700) at pthread_create.c:463
#29 0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The full logging file for thread apply all bt is given below.
gdb.txt

mohsin-ashraf on 28 Jul 2020

thread apply all bt is the command needed to get all the thread backtraces. (But, from your one thread, it looks like important symbols might be missing, so this may not be as helpful as I'd hoped until/unless that's fixed, and I don't know the fix offhand.)

gojomo on 28 Jul 2020

thread apply all bt is the command needed to get _all_ the thread backtraces. (But, from your one thread, it looks like important symbols might be missing, so this may not be as helpful as I'd hoped until/unless that's fixed, and I don't know the fix offhand.)

Just updated my comment above

mohsin-ashraf on 28 Jul 2020

When this issue will be resolved in the gensim?

mohsin-ashraf on 29 Jul 2020

Re: the backtrace(s):

I think the one thread backtrace you've highlighted is should be the one where the segfault occurred, though I believe I've occasionally seen cases where the 'current thread' in a core is something else. (And often, the thread/frame that "steps on" some misguided data isn't the one that caused the problem, via some more subtle error arbitrarily earlier.)

Having symbols & line-numbers in the trace would make it more useful, but I'm not sure what (probably minor) steps you'd have to take to get those. (It might be enabled via installing some gdb extra, or using a cygdbgdb-wrapper that comes with cython, or maybe even just using a py-bt command instead of bt.)

However, looking at just the filenames, it seems the segfault actually occurs inside scipy code, that code in doc2vec_corpusfile (in frame #3) likely calls with unwise/already-corrupted parameters. (When we have symbols, debugger-inspecting that frame, and/or adding extra sanity-checking/logging around that offending line, will likely provide the next key clue to the real problem.)

Re: when fixed?

You've provided a clear map to reproducing & what's likely involved (a signed 32-bit int overflow), and I suspect the recipe to trigger can be made even smaller/faster. (EG: instead of 7.2M 300D vectors w/ 100-word training docs, 400K 6000D vectors w/ 1-word training docs is likely to trigger the same overflow - so no 300GB+ test file or long slow vocab-scan necessary.) That will make it easier for myself or others to investigate further. But I'm not sure when there'll be time for that, or it will succeed in finding a fix, or when an official release with the fix will happen.

In the meantime, workarounds could include: (1) using the non-corpus_file method of specifying the corpus - which, if you were successfully using ~30 threads, may slow your training by a factor of 3 or more, but should complete without segfault. (2) training only on some representative subset of docs, and/or with lower dimensions, making sure doc_count * vector_size < 2^31 - but then inferring vectors, outside of training, for any excess documents.

gojomo on 29 Jul 2020

I wrote:

You've provided a clear map to reproducing & what's likely involved (a signed 32-bit int overflow), and I suspect the recipe to trigger can be made even smaller/faster. (EG: instead of 7.2M 300D vectors w/ 100-word training docs, 400K 6000D vectors w/ 1-word training docs is likely to trigger the same overflow - so no 300GB+ test file or long slow vocab-scan necessary.)

Confirmed that with a test file created by...

with open('400klines', 'w') as f:
    for _ in range(400000):
        f.write('a\n')

...the following is enough to trigger a fault...

model = Doc2Vec(corpus_file='400klines', min_count=1, vector_size=6000)

Further, it may be sufficient to change the 3 lines in doc2vec_corpusfile.pyx that read...

 cdef int _doc_tag = start_doctag

...to...

 cdef long long _doc_tag = start_doctag

That avoids the crash in the tiny test case above.

gojomo on 29 Jul 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Word2Vec online training not consistent

sairampillai · 3Comments

error i using mallet

ahmedbhabbas · 4Comments

Bug in Phrases.export_phrases()

jeradf · 4Comments

Word2Vec ns_exponent cannot be changed from default

coopwilliams · 3Comments

LDAViz and streaming corpus

franciscojavierarceo · 3Comments