hey,
I feel something is bad with my text data. I used common crawl for text data and even without any preprocesing (to dismiss as error source) the fasttext model stops during training and just exists without any messages. That is only for some ngram ranges like:
model=FastText(sentences,min_n=4)
print(model)
In this case no modell summary is printed and the model was not built but there is no message at all.
Giving as paramater instead just min_n=2, max_n=4 the model is built and works fine.
I used my text data and the sentence iterator like here
>>> from gensim.utils import tokenize
>>> import smart_open
>>>
>>>
>>> class MyIter(object):
... def __iter__(self):
... path = datapath('crime-and-punishment.txt')
... with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
... for line in fin:
... yield list(tokenize(line))
with above fasttext statement.
Might this comes from bad encoding like \xad? How to find out what is bad in my data?
Thank you for the report. Sounds like a bug as opposed to a data problem.
In this case no modell summary is printed and the model was not built but there is no message at all.
Not seeing ANY output very odd. Did the Python subprocess segfault? Was there a core dump or non-zero exit code?
We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0?
Can you truncate the dataset and still reproduce the problem? If yes, please post your truncated data here.
Also, what OS and Python version are you using?
Thank you @mpenkov !
Did the Python subprocess segfault? Was there a core dump or non-zero exit code?
If this would be the case is this shown inside the console? Because there was really no message it just quits.
We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0?
That sounds good. I will try this version!
I use windows 10, 4 core 64 bit i7 with python 3.6
I would like not to post my data but I test it on a open data set. Where can I post the data to you personally maybe?
If this would be the case is this shown inside the console?
I鈥檓 not familiar with Windows, so I can鈥檛 answer that, sorry.
Where can I post the data to you personally maybe?
Please try reducing the data to a smaller subset first. It鈥檚 highly likely that you can reproduce the bug with a few sentences as opposed to an entire corpus.
@ctrado18 are you the same person as on the mailing list?
Hey guys,
I tested with v3.6. Still same. Also I used the test data https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/alldata-id-10.txt
I removed my preprocessor to exlude this as a problem. But it is still the same.
So, it is not a problem with my data or iterator and preprocessor. That rises a somehow deeper issue with my hardware? But I uses a very new one.
So, what can I do to find out what is going on here?
Here is my code together with above test data:
class MyIter(object):
def __iter__(self):
path = datapath('dat.txt')
with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
for line in fin:
s = sentence_detector.tokenize(line)
for k in s:
if k:
yield list(tokenize(k))
model = FastText(sentences=MyIter(),sg=1,hs=1,min_n=4,max_n=6)
print(model)
It is just quitting without print anything. For min_n=2 and max_n=4 it works! Strange.
Also using without sg=hs=1 just:
model = FastText(sentences=MyIter()min_n=4,max_n=6)
print(model)
it works. So I think it is also about the skipgram modus?
@ctrado18 are you the same person as on the mailing list?
I am sorry for the inflationary discussion but I felt missunderstood first because problem was clear for me. I hope everything is clear now.
Thank you for providing detailed information. I've reproduced the bug:
bug.py:
from gensim.models.fasttext import FastText
from gensim.utils import tokenize
from gensim.test.utils import datapath
import smart_open
import logging
logging.basicConfig(level=logging.INFO)
path = datapath('alldata-id-10.txt')
with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
sentences = [list(tokenize(l)) for l in fin]
model = FastText(sentences=sentences, sg=1, hs=1, min_n=4, max_n=6)
print(model)
reproduced example:
(gensim) misha@cabron:~/git/gensim$ time python bug.py
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
Segmentation fault (core dumped)
real 0m9.275s
user 0m7.132s
sys 0m2.907s
gdb session:
(gdb) r bug.py
Starting program: /home/misha/envs/gensim/bin/python bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff1eb6700 (LWP 10740)]
[New Thread 0x7ffff16b5700 (LWP 10741)]
[New Thread 0x7fffeeeb4700 (LWP 10742)]
[New Thread 0x7fffea6b3700 (LWP 10743)]
[New Thread 0x7fffe7eb2700 (LWP 10744)]
[New Thread 0x7fffe56b1700 (LWP 10745)]
[New Thread 0x7fffe4eb0700 (LWP 10746)]
[Thread 0x7fffe7eb2700 (LWP 10744) exited]
[Thread 0x7fffe4eb0700 (LWP 10746) exited]
[Thread 0x7fffe56b1700 (LWP 10745) exited]
[Thread 0x7fffea6b3700 (LWP 10743) exited]
[Thread 0x7fffeeeb4700 (LWP 10742) exited]
[Thread 0x7ffff16b5700 (LWP 10741) exited]
[Thread 0x7ffff1eb6700 (LWP 10740) exited]
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
[New Thread 0x7fffe4eb0700 (LWP 11098)]
[New Thread 0x7fffe56b1700 (LWP 11099)]
[New Thread 0x7fffe7eb2700 (LWP 11100)]
[New Thread 0x7fffea6b3700 (LWP 11101)]
[Thread 0x7fffea6b3700 (LWP 11101) exited]
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
[Thread 0x7fffe4eb0700 (LWP 11098) exited]
Thread 10 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe56b1700 (LWP 11099)]
__pyx_f_6gensim_6models_14fasttext_inner_fasttext_fast_sentence_sg_hs (__pyx_v_word_point=0xb53900, __pyx_v_word_code=0x121c3b0 "\001\001", __pyx_v_codelen=<optimized out>,
__pyx_v_syn0_vocab=<optimized out>, __pyx_v_syn0_ngrams=0x7ffee671c010, __pyx_v_syn1=0x1808b70, __pyx_v_size=<optimized out>, __pyx_v_word2_index=33, __pyx_v_subwords_index=0x1428670,
__pyx_v_subwords_len=0, __pyx_v_alpha=0.0250000004, __pyx_v_work=0x7fffc8001000, __pyx_v_l1=0x7fffc8001200, __pyx_v_word_locks_vocab=0x1810570, __pyx_v_word_locks_ngrams=0x7fff757ef010)
at ./gensim/models/fasttext_inner.c:2593
2593 __pyx_v_g = (((1 - (__pyx_v_word_code[__pyx_v_b])) - __pyx_v_f) * __pyx_v_alpha);
(gdb) p __pyx_v_word_code
$1 = (const __pyx_t_5numpy_uint8_t *) 0x121c3b0 "\001\001"
(gdb) p __pyx_v_f
Cannot access memory at address 0x7ffdd5dd87c0
(gdb) p __pyx_v_alpha
$2 = 0.0250000004
(gdb) p __pyx_v_b
$3 = 0
Looks like we're trying to access memory that we shouldn't be touching.
We'll need to debug the Cython code to work out what the problem is.
@mpenkov WOW. I am so happy. Thanks! That is really great from you! I had so much headaches because of this and gone crazy since it was such a obvious bug such that I check my whole text data sentence for sentence...
But, why am I the first one who observes this? I mean are there so less people who are working with that. Since it also due to sg=1 no one seems using this...
That is bad because all my research I can do is just for ngram range 2-4 which is a bit too smalll for my case. So I should have used logging for the segfault. What is gdb?
Anyway Thanks! 馃槃
I look forward for the solution! 馃槃
Most helpful comment
Thank you for providing detailed information. I've reproduced the bug:
bug.py:
reproduced example:
gdb session:
Looks like we're trying to access memory that we shouldn't be touching.
We'll need to debug the Cython code to work out what the problem is.