Gensim: FastText segfaults for some ngram ranges

Created on 7 Feb 2019 · 8Comments · Source: RaRe-Technologies/gensim

hey,

I feel something is bad with my text data. I used common crawl for text data and even without any preprocesing (to dismiss as error source) the fasttext model stops during training and just exists without any messages. That is only for some ngram ranges like:

model=FastText(sentences,min_n=4)
print(model)

In this case no modell summary is printed and the model was not built but there is no message at all.

Giving as paramater instead just min_n=2, max_n=4 the model is built and works fine.

I used my text data and the sentence iterator like here

>>> from gensim.utils import tokenize
>>> import smart_open
>>>
>>>
>>> class MyIter(object):
...     def __iter__(self):
...         path = datapath('crime-and-punishment.txt')
...         with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
...             for line in fin:
...                 yield list(tokenize(line))

with above fasttext statement.

Might this comes from bad encoding like \xad? How to find out what is bad in my data?

bug fasttext

Source

ctrado18

Most helpful comment

Thank you for providing detailed information. I've reproduced the bug:

bug.py:

from gensim.models.fasttext import FastText
from gensim.utils import tokenize
from gensim.test.utils import datapath
import smart_open

import logging
logging.basicConfig(level=logging.INFO)

path = datapath('alldata-id-10.txt')
with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
    sentences = [list(tokenize(l)) for l in fin]
model = FastText(sentences=sentences, sg=1, hs=1, min_n=4, max_n=6)
print(model)

reproduced example:

(gensim) misha@cabron:~/git/gensim$ time python bug.py 
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
Segmentation fault (core dumped)

real    0m9.275s
user    0m7.132s
sys     0m2.907s

gdb session:

(gdb) r bug.py
Starting program: /home/misha/envs/gensim/bin/python bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff1eb6700 (LWP 10740)]
[New Thread 0x7ffff16b5700 (LWP 10741)]
[New Thread 0x7fffeeeb4700 (LWP 10742)]
[New Thread 0x7fffea6b3700 (LWP 10743)]
[New Thread 0x7fffe7eb2700 (LWP 10744)]
[New Thread 0x7fffe56b1700 (LWP 10745)]
[New Thread 0x7fffe4eb0700 (LWP 10746)]
[Thread 0x7fffe7eb2700 (LWP 10744) exited]
[Thread 0x7fffe4eb0700 (LWP 10746) exited]
[Thread 0x7fffe56b1700 (LWP 10745) exited]
[Thread 0x7fffea6b3700 (LWP 10743) exited]
[Thread 0x7fffeeeb4700 (LWP 10742) exited]
[Thread 0x7ffff16b5700 (LWP 10741) exited]
[Thread 0x7ffff1eb6700 (LWP 10740) exited]
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
[New Thread 0x7fffe4eb0700 (LWP 11098)]
[New Thread 0x7fffe56b1700 (LWP 11099)]
[New Thread 0x7fffe7eb2700 (LWP 11100)]
[New Thread 0x7fffea6b3700 (LWP 11101)]
[Thread 0x7fffea6b3700 (LWP 11101) exited]
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
[Thread 0x7fffe4eb0700 (LWP 11098) exited]

Thread 10 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe56b1700 (LWP 11099)]
__pyx_f_6gensim_6models_14fasttext_inner_fasttext_fast_sentence_sg_hs (__pyx_v_word_point=0xb53900, __pyx_v_word_code=0x121c3b0 "\001\001", __pyx_v_codelen=<optimized out>, 
    __pyx_v_syn0_vocab=<optimized out>, __pyx_v_syn0_ngrams=0x7ffee671c010, __pyx_v_syn1=0x1808b70, __pyx_v_size=<optimized out>, __pyx_v_word2_index=33, __pyx_v_subwords_index=0x1428670, 
    __pyx_v_subwords_len=0, __pyx_v_alpha=0.0250000004, __pyx_v_work=0x7fffc8001000, __pyx_v_l1=0x7fffc8001200, __pyx_v_word_locks_vocab=0x1810570, __pyx_v_word_locks_ngrams=0x7fff757ef010)
    at ./gensim/models/fasttext_inner.c:2593
2593        __pyx_v_g = (((1 - (__pyx_v_word_code[__pyx_v_b])) - __pyx_v_f) * __pyx_v_alpha);
(gdb) p __pyx_v_word_code
$1 = (const __pyx_t_5numpy_uint8_t *) 0x121c3b0 "\001\001"
(gdb) p __pyx_v_f
Cannot access memory at address 0x7ffdd5dd87c0
(gdb) p __pyx_v_alpha
$2 = 0.0250000004
(gdb) p __pyx_v_b
$3 = 0

Looks like we're trying to access memory that we shouldn't be touching.

We'll need to debug the Cython code to work out what the problem is.

mpenkov on 8 Feb 2019

👍2

All 8 comments

Thank you for the report. Sounds like a bug as opposed to a data problem.

In this case no modell summary is printed and the model was not built but there is no message at all.

Not seeing ANY output very odd. Did the Python subprocess segfault? Was there a core dump or non-zero exit code?

We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0?

Can you truncate the dataset and still reproduce the problem? If yes, please post your truncated data here.

Also, what OS and Python version are you using?

mpenkov on 7 Feb 2019

Seems related to this and this thread on the mailing list.

piskvorky on 8 Feb 2019

👀1

Thank you @mpenkov !

Did the Python subprocess segfault? Was there a core dump or non-zero exit code?

If this would be the case is this shown inside the console? Because there was really no message it just quits.

We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0?

That sounds good. I will try this version!

I use windows 10, 4 core 64 bit i7 with python 3.6

I would like not to post my data but I test it on a open data set. Where can I post the data to you personally maybe?

ctrado18 on 8 Feb 2019

If this would be the case is this shown inside the console?

I’m not familiar with Windows, so I can’t answer that, sorry.

Where can I post the data to you personally maybe?

Please try reducing the data to a smaller subset first. It’s highly likely that you can reproduce the bug with a few sentences as opposed to an entire corpus.

mpenkov on 8 Feb 2019

@ctrado18 are you the same person as on the mailing list?

piskvorky on 8 Feb 2019

👍1

Hey guys,

I tested with v3.6. Still same. Also I used the test data https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/alldata-id-10.txt

I removed my preprocessor to exlude this as a problem. But it is still the same.

So, it is not a problem with my data or iterator and preprocessor. That rises a somehow deeper issue with my hardware? But I uses a very new one.

So, what can I do to find out what is going on here?

Here is my code together with above test data:


class MyIter(object):
    def __iter__(self):
        path = datapath('dat.txt')
        with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
            for line in fin:
                s = sentence_detector.tokenize(line)
                for k in s:
                        if k:
                            yield list(tokenize(k))

model = FastText(sentences=MyIter(),sg=1,hs=1,min_n=4,max_n=6)
print(model)

It is just quitting without print anything. For min_n=2 and max_n=4 it works! Strange.

Also using without sg=hs=1 just:

model = FastText(sentences=MyIter()min_n=4,max_n=6)
print(model)

it works. So I think it is also about the skipgram modus?

@ctrado18 are you the same person as on the mailing list?

I am sorry for the inflationary discussion but I felt missunderstood first because problem was clear for me. I hope everything is clear now.

ctrado18 on 8 Feb 2019

Thank you for providing detailed information. I've reproduced the bug:

bug.py:

from gensim.models.fasttext import FastText
from gensim.utils import tokenize
from gensim.test.utils import datapath
import smart_open

import logging
logging.basicConfig(level=logging.INFO)

path = datapath('alldata-id-10.txt')
with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
    sentences = [list(tokenize(l)) for l in fin]
model = FastText(sentences=sentences, sg=1, hs=1, min_n=4, max_n=6)
print(model)

reproduced example:

(gensim) misha@cabron:~/git/gensim$ time python bug.py 
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
Segmentation fault (core dumped)

real    0m9.275s
user    0m7.132s
sys     0m2.907s

gdb session:

(gdb) r bug.py
Starting program: /home/misha/envs/gensim/bin/python bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff1eb6700 (LWP 10740)]
[New Thread 0x7ffff16b5700 (LWP 10741)]
[New Thread 0x7fffeeeb4700 (LWP 10742)]
[New Thread 0x7fffea6b3700 (LWP 10743)]
[New Thread 0x7fffe7eb2700 (LWP 10744)]
[New Thread 0x7fffe56b1700 (LWP 10745)]
[New Thread 0x7fffe4eb0700 (LWP 10746)]
[Thread 0x7fffe7eb2700 (LWP 10744) exited]
[Thread 0x7fffe4eb0700 (LWP 10746) exited]
[Thread 0x7fffe56b1700 (LWP 10745) exited]
[Thread 0x7fffea6b3700 (LWP 10743) exited]
[Thread 0x7fffeeeb4700 (LWP 10742) exited]
[Thread 0x7ffff16b5700 (LWP 10741) exited]
[Thread 0x7ffff1eb6700 (LWP 10740) exited]
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
[New Thread 0x7fffe4eb0700 (LWP 11098)]
[New Thread 0x7fffe56b1700 (LWP 11099)]
[New Thread 0x7fffe7eb2700 (LWP 11100)]
[New Thread 0x7fffea6b3700 (LWP 11101)]
[Thread 0x7fffea6b3700 (LWP 11101) exited]
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
[Thread 0x7fffe4eb0700 (LWP 11098) exited]

Thread 10 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe56b1700 (LWP 11099)]
__pyx_f_6gensim_6models_14fasttext_inner_fasttext_fast_sentence_sg_hs (__pyx_v_word_point=0xb53900, __pyx_v_word_code=0x121c3b0 "\001\001", __pyx_v_codelen=<optimized out>, 
    __pyx_v_syn0_vocab=<optimized out>, __pyx_v_syn0_ngrams=0x7ffee671c010, __pyx_v_syn1=0x1808b70, __pyx_v_size=<optimized out>, __pyx_v_word2_index=33, __pyx_v_subwords_index=0x1428670, 
    __pyx_v_subwords_len=0, __pyx_v_alpha=0.0250000004, __pyx_v_work=0x7fffc8001000, __pyx_v_l1=0x7fffc8001200, __pyx_v_word_locks_vocab=0x1810570, __pyx_v_word_locks_ngrams=0x7fff757ef010)
    at ./gensim/models/fasttext_inner.c:2593
2593        __pyx_v_g = (((1 - (__pyx_v_word_code[__pyx_v_b])) - __pyx_v_f) * __pyx_v_alpha);
(gdb) p __pyx_v_word_code
$1 = (const __pyx_t_5numpy_uint8_t *) 0x121c3b0 "\001\001"
(gdb) p __pyx_v_f
Cannot access memory at address 0x7ffdd5dd87c0
(gdb) p __pyx_v_alpha
$2 = 0.0250000004
(gdb) p __pyx_v_b
$3 = 0

Looks like we're trying to access memory that we shouldn't be touching.

We'll need to debug the Cython code to work out what the problem is.

mpenkov on 8 Feb 2019

👍2

@mpenkov WOW. I am so happy. Thanks! That is really great from you! I had so much headaches because of this and gone crazy since it was such a obvious bug such that I check my whole text data sentence for sentence...

But, why am I the first one who observes this? I mean are there so less people who are working with that. Since it also due to sg=1 no one seems using this...

That is bad because all my research I can do is just for ngram range 2-4 which is a bit too smalll for my case. So I should have used logging for the segfault. What is gdb?

Anyway Thanks! 😄

I look forward for the solution! 😄

ctrado18 on 8 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Structural Topic Models in gensim

cschwem2er · 27Comments

Drop Py2 support

mpenkov · 29Comments

Set up Azure pipelines for gensim

mpenkov · 34Comments

word2vec (& doc2vec) training doesn't benefit from all CPU cores with high `workers` values

jticknor · 42Comments

loading fastText model trained with pretrained_vectors still fails (see: #2350)

cbjrobertson · 24Comments