Gensim: FastText SkipGram Implementation Broken since 3.7.2

Created on 29 May 2019  路  7Comments  路  Source: RaRe-Technologies/gensim

The FastText implementation using skip-gram appears to be broken since 3.7.2. Below is the sample code I am using, which is almost identical to the example in the docs but with additional printed output. In v3.7.1, everything runs fine, but in subsequent versions, an IndexError occurs during train_sg_pair.

# Sample Code
import sys
import gensim
from gensim.models import FastText
from gensim.test.utils import common_texts
print(f"Python {sys.version.split()[0]} | Gensim {gensim.__version__}")

sim_word = "computer"

print("CBOW")
cbow = FastText(size=4, window=3, min_count=1)
cbow.build_vocab(sentences=common_texts)
cbow.train(sentences=common_texts, total_examples=len(common_texts), epochs=10)
cbow_similarities = " | ".join(
    [f"{word}: {sim:0.4f}" for (word, sim) in cbow.most_similar("computer")]
)
print(f"{sim_word}:: {cbow_similarities}")

print("Skip-Gram")
sg = FastText(size=4, window=3, min_count=1,
              sg=1)      # only difference!
sg.build_vocab(sentences=common_texts)
sg.train(sentences=common_texts, total_examples=len(common_texts), epochs=10)
sg_similarities = " | ".join(
    [f"{word}: {sim:0.4f}" for (word, sim) in sg.most_similar("computer")]
)
print(f"{sim_word}:: {sg_similarities}")

Works in 3.7.1

Python 3.7.3 | Gensim 3.7.1
CBOW
<input>:16: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
computer:: graph: 0.6275 | time: 0.4795 | interface: 0.3012 | user: 0.1459 | trees: 0.0747 | system: -0.1502 | human: -0.2375 | survey: -0.3557 | response: -0.5107 | eps: -0.5126
Skip-Gram
<input>:27: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
computer:: graph: 0.6274 | time: 0.4800 | interface: 0.3014 | user: 0.1466 | trees: 0.0741 | system: -0.1501 | human: -0.2371 | survey: -0.3564 | response: -0.5109 | eps: -0.5127

Fails in 3.7.3

Python 3.7.3 | Gensim 3.7.3
CBOW
C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.
  "C extension not loaded, training will be slow. "
<input>:16: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
computer:: human: 0.8081 | interface: 0.5414 | graph: 0.4632 | time: 0.3914 | survey: 0.0709 | eps: -0.1581 | minors: -0.1638 | trees: -0.2344 | user: -0.4144 | system: -0.4159
Skip-Gram
Exception in thread Thread-47:
Traceback (most recent call last):
  File "C:\Users\yzxs008\AppData\Local\Programs\Python\Python37\lib\threading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\Users\yzxs008\AppData\Local\Programs\Python\Python37\lib\threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\base_any2vec.py", line 211, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\fasttext.py", line 834, in _do_train_job
    tally += train_batch_sg(self, sentences, alpha, work, neu1)
  File "C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\fasttext.py", line 412, in train_batch_sg
    train_sg_pair(model, model.wv.index2word[word2.index], subwords_indices, alpha, is_ft=True)
  File "C:\Users\yzxs008\Documents\ml_env\lib\site-packages\gensim\models\word2vec.py", line 418, in train_sg_pair
    l1_ngrams = np_sum(context_vectors_ngrams[context_index[1:]], axis=0)
IndexError: too many indices for array

All 7 comments

Given the extra error ("C extension not loaded, training will be slow"), it looks like (1) your gensim-3.7.3 installation didn't get the native libraries your earlier installation did; and (2) the gensim plain-Python code path is what's broken. That's rarely used, as it's up-to-100x slower, and thus far must be manually tested (since the normal, important testing successfully loads/tests the optimized variants).

So, @zstachniak, your local problem may be fixable by ensuring the native libraries are available. On Windows, often a 'wheel' install or 'conda' install will succeed in that, even when a 'pip install' does not. (You have to watch the install output closely; a failure to build native libraries will generate a message, but not cause the overall installation to fail.)

The gensim-side problem would require either (1) fixing-up & testing the pure-Python paths (and perhaps arranging the pure-Python paths to be auto-tested, though that'd be a pain that also slows automated testing noticeably; (2) explicitly dropping support for the plain-Python paths, improving the error messages when the optimized code isn't available.

I'm inclined toward 2) We're really trying to tighten up our interfaces & remove brittle / academic fluff now.

The pure Python path may have been useful for educational reasons historically, but serves little purpose now (aside from the lack of testing / masking installation issues).

CC @mpenkov thoughts?

Ah, interesting. @gojomo , any idea why a pip install on 3.7.1 work with my C compiler but 3.7.2 and above do not? I'm not seeing any messages indicating an error during install...

Update: When trying to install directly from a PyPI download, I did finally encounter error messages during install (but still only for 3.7.3). For some reason, performing a pip install gensim==3.7.3 install was not warning me about any problems.

Devs, let me know if I should close this issue, and thanks for your support!

For any other Python users who are forced to use a Windows box...
After spending far too long monkeying around with Visual Studio C++ compiler support, I ended up resorting to installing gensim from a windows binary. My error message has gone away and everything is running correctly now.

I'm +1 for removing native Python support for fasttext. I can't see a reason for using it. @menshikh-iv WDYT?

@mpenkov I'm +1 for drop pure-python implementation of w2v/d2v/ft/etc and stay only cython implementations.

OK, opened a separate ticket to deal with it. I think we can close this one.

Was this page helpful?
0 / 5 - 0 ratings