Spacy: ValueError: a must be greater than 0 unless no samples are taken while pretraining using cli

Created on 3 Dec 2020  Â·  7Comments  Â·  Source: explosion/spaCy

Hi,
I am using spacy version 2.3.2 for pretraining spacy tok2vec.
I prepared a raw text data like the format spacy ask for.
mydata.jsonl is looking like this:

```
{"text": "ăƒ›ăƒƒă‚±ăƒŒă«ăŻăƒ‡ăƒłă‚žăƒŁăƒ©ă‚čăƒ—ăƒŹăƒŒăźćć‰‡ăŒă‚ă‚‹ăźă§ă€è†ă‚ˆă‚ŠäžŠă«ăƒœăƒŒăƒ«ă‚’æ”źă‹ă™ă“ăšăŻćŸșæœŹçš„ă«ćć‰‡ă«ăȘă‚‹ăŒă€ăăźäŸ‹ć€–ăźäž€ă€ăŒă“ăźă‚čă‚ŻăƒŒăƒ—ă§ă‚ă‚‹ă€‚"}
{"text": "ăŸăŸèĄŒăăŸă„ă€ăă‚“ăȘæ°—æŒăĄă«ă•ă›ăŠăă‚Œă‚‹ăŠćș—です。"}

and my pretraining cli command is:

   ` python -m spacy pretrain mydata.jsonl ja_core_news_lg outpath`

After running this command I got this error:
I changed japanese model version and still having the sample problem. I trained with english data and it's okay but problem only exist in japanese language text. 

 ```
    :information_source: Using GPU
    :warning: Output directory is not empty
    It is better to use an empty directory or refer to a new output path, then the
    new directory will be created for you.
    :heavy_check_mark: Saved settings to config.json
    :heavy_check_mark: Loaded input texts
    :heavy_check_mark: Loaded model 'ja_core_news_lg'

    ============== Pre-training tok2vec layer - starting at epoch 0 ==============
    # Words Total Loss Loss w/s

    Traceback (most recent call last):
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "main", mod_spec)
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/main.py", line 33, in
    plac.call(commands[command], sys.argv[1:])
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/cli/pretrain.py", line 237, in pretrain
    model, docs, optimizer, objective=loss_func, drop=dropout
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/cli/pretrain.py", line 264, in make_update
    predictions, backprop = model.begin_update(docs, drop=drop)
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/_ml.py", line 837, in mlm_forward
    mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob)
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/_ml.py", line 884, in _apply_mask
    word = _replace_word(token.text, random_words)
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/_ml.py", line 904, in _replace_word
    return random_words.next()
    File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/_ml.py", line 865, in next
    numpy.random.choice(len(self.words), 10000, p=self.probs)
    File "mtrand.pyx", line 902, in numpy.random.mtrand.RandomState.choice
    ValueError: a must be greater than 0 unless no samples are taken

feat / tok2vec lang / ja

All 7 comments

I found the problem location and details but still not get the problem.
The problem is here

class _RandomWords(object):
    def __init__(self, vocab):
        self.words = [lex.text for lex in vocab if lex.prob != 0.0]
        self.probs = [lex.prob for lex in vocab if lex.prob != 0.0]
        self.words = self.words[:10000]
        self.probs = self.probs[:10000]
        self.probs = numpy.exp(numpy.array(self.probs, dtype="f"))
        self.probs /= self.probs.sum()
        self._cache = []

    def next(self):
        if not self._cache:
            self._cache.extend(
                numpy.random.choice(len(self.words), 10000, p=self.probs)
            )
        index = self._cache.pop()
        return self.words[index]

I checked english model vocabulary by

import spacy
nlp = spacy.load('en_core_web_lg')
vocab = nlp.vocab
for i in vocab:
    print(i.text)

It's print vocab tokens but similarly it's not working for japanese model ja_core_news_lg
Can anyone explain why?

Hi,

What is exactly that you're trying to do? You mention this command:

python -m spacy pretrain mydata.jsonl ja_core_news_lg outpath

But pretraining is really not useful in combination with a pretrained model like ja_core_news_lg.

Basically, what pretraining does, is to create reasonable initial weights for the tok2vec layer in a spaCy pipeline. You can then use this pretrained tok2vec layer in a subsequent train step to train additional spaCy components (like NER, textcat...) on top of it.

However, the ja_core_news_lg already contains a trained Tok2Vec layer AND trained parser & NER components. Do you want to use any of that? Or do you just want token vectors trained on your specific text?

Hi,
Thanks for the quick response.
Basically I have a small amount of NER data.
I want to build a pretrain tok2vec model from my large raw data and will feed this tok2vec model to NER training.
According to cli/pretraining instruction it will helpful to use pretrain weight for small amount of data for NER.

However, the ja_core_news_lg already contains a trained Tok2Vec layer AND trained parser & NER components. Do you want to use any of that? Or do you just want token vectors trained on your specific text?

Actually I want to train tok2vec to my specific raw text and to use it in my NER model.

regards

Basically I have a small amount of NER data.
I want to build a pretrain tok2vec model from my large raw data and will feed this tok2vec model to NER training.

Yes, that makes sense! Actually, would you feel like trying out the release candidate of spaCy 3? In version 3, we've made the pretraining functionality much more user-friendly, as it was all a bit experimental still in version 2.

The docs for v3 are here: https://nightly.spacy.io/ - you would be specifically interested in https://nightly.spacy.io/usage/embeddings-transformers#pretraining

In v3, you create a config file that will power both your training and your pretraining steps. You can create it with commands like these:

python -m spacy init config -l ja -p "ner" --cpu ja_ner_base.cfg
python -m spacy init fill-config ja_ner_base.cfg ja_ner_final.cfg --pretraining

And then you'd run pretraining:

python -m spacy pretrain ja_ner_final.cfg ./pretrain_output --paths.raw_text mydata.jsonl

and training:

python -m spacy train ja_ner_final.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy --paths.init_tok2vec=pretrain_output/model2.bin

Hi @svlandeg
Thanks for the suggestion.
I have checked spacy v3 nightly. It's really amazing.
But present working with spacy 2. Any possible way for pretraining tok2vec for japanese in v2 will help me a lot.
I am keeping this issue open for new suggestions.
regards

Update About This Issue

I somewhat solve this problem by editing pretrain.py script.
Inside pretrain.py while loading nlp.vocab any japanese model return 0 while English model working fine.
So, I use nlp.vocab.prune_vectors(10) after loading the model.
But another problem arise though. while pruning in GPU it's arises:

TypeError: list indices must be integers or slices, not cupy.core.core.ndarray

So I disable GPU and trained in CPU and it's working fine.

Thanks for reporting back @sagor71. As this feature was experimental in v2, and should work much better in v3, it's not really a priority for us to spend much more time on this for v2. I'm happy to hear you found a working solution though. I'll close this in the meantime, but let us know if you still run into issues!

Was this page helpful?
0 / 5 - 0 ratings