Spacy: Errors when updating already trained TextCategorizer

Created on 19 Dec 2020 · 8Comments · Source: explosion/spaCy

How to reproduce the behaviour

pip install spacy[cuda110], scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_lg-0.3.0.tar.gz

This week I successfully trained a TextCategorizer on gpu.

The following is the gist of the training code:

spacy.prefer_gpu()

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    # if base
    if continue_training:
        # Start with an existing model, use default optimizer
        optimizer = nlp.resume_training()
    else:
        optimizer = nlp.begin_training()

    ...

    for i in range(epochs):
        for batch, dropout in zip(batches, dropouts):
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)

    # store final model
    with nlp.use_params(optimizer.averages):
        nlp.to_disk(output_dir)

The model was saved to directory spacy_ps_20201216.
Inside that directory are the following artifacts:
meta.json, textcat, tokenizer, vocab

Now, I want to load that same model and continue training.
From my understanding, I think the code, nlp.resume_training(), would do the trick; but I'm having issues either loading or updating the textcat to continue such training.

Here are the multiple ways I've attempted loading the textcat model, and their respective errors:

base_model = 'spacy_ps_20201216'
nlp = spacy.load(base_model)

Note: This works until the script reaches the update step

Traceback (most recent call last):
  File "spacy_textcat.py", line 310, in <module>
    plac.call(main)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "spacy_textcat.py", line 295, in main
    train_textcat(nlp,
  File "spacy_textcat.py", line 186, in train_textcat
    nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "pipes.pyx", line 1020, in spacy.pipeline.pipes.TextCategorizer.update
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 30, in begin_update
    res, bp_res = func.begin_update((X, lengths))
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 53, in mean_pool
    output = ops.mean_pool(X, lengths)
  File "thinc/neural/ops.pyx", line 733, in thinc.neural.ops.NumpyOps.mean_pool
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'cupy.core.core.ndarray'

I then tried to load the initial model, __then__ add the loaded textcat from the base_model.

nlp = spacy.load('en_core_sci_lg')
nlp.add_pipe(spacy.load(base_model).get_pipe('textcat'))

Note: This works on loading, but when update line is reached, I receive a familiar error.

Traceback (most recent call last):
  File "spacy_textcat.py", line 307, in <module>
    plac.call(main)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "spacy_textcat.py", line 292, in main
    train_textcat(nlp,
  File "spacy_textcat.py", line 183, in train_textcat
    nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "pipes.pyx", line 1020, in spacy.pipeline.pipes.TextCategorizer.update
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 30, in begin_update
    res, bp_res = func.begin_update((X, lengths))
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 53, in mean_pool
    output = ops.mean_pool(X, lengths)
  File "thinc/neural/ops.pyx", line 733, in thinc.neural.ops.NumpyOps.mean_pool
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'cupy.core.core.ndarray'

I then hopped out of the script to try some other ways:

>>> base_model = 'spacy_ps_20201216/textcat'
>>> nlp = spacy.load('en_core_sci_lg')
>>> textcat = spacy.pipeline.TextCategorizer(nlp.vocab)
>>> textcat.from_disk(base_model)
ValueError: Can't read file: spacy_ps_20201216/textcat/vocab/strings.json

I then moved spacy_ps_20201216/vocab into spacy_ps_20201216/textcat and reran successfully.
I updated my script code to have that change and received the following error when updating:

Traceback (most recent call last):
  File "spacy_textcat.py", line 312, in <module>
    plac.call(main)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "spacy_textcat.py", line 297, in main
    train_textcat(nlp,
  File "spacy_textcat.py", line 188, in train_textcat
    nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "pipes.pyx", line 1016, in spacy.pipeline.pipes.TextCategorizer.update
  File "pipes.pyx", line 88, in spacy.pipeline.pipes.Pipe.require_model
ValueError: [E109] Model for component 'textcat' not initialized. Did you forget to load a model, or forget to call begin_training()?

Back to the terminal with base_model='spacy_ps_20201216':

>>> from spacy.lang.en import English
>>> nlp = English().from_disk(base_model,exclude=["parser","tagger", "ner"])
>>> nlp.pipe_names
[]

>>> nlp = spacy.load('en_core_sci_lg')
>>> nlp.from_disk(base_model,exclude=["parser","tagger", "ner"])
<spacy.lang.en.English object at 0x7f9a02cd8e50>
>>> nlp.pipe_names
['tagger', 'parser', 'ner']

>>> nlp = nlp.from_disk(base_model,exclude=["parser","tagger", "ner"])
>>> nlp.pipe_names
['tagger', 'parser', 'ner']

I would appreciate any pointers on the right way to load a trained TextCategorizer and continue training.

Your Environment

spaCy version: 2.3.5
Platform: Linux-4.19.0-12-cloud-amd64-x86_64-with-glibc2.10
Python version: 3.8.6
Operating System (GCP Image): pytorch-1-6-cu110-notebooks-v20201105-debian-10

feat / serialize feat / textcat resolved

Source

jgieringer

Most helpful comment

Totally understood, my apologies!

I've trimmed the script to the code below.
Can confirm that if I remove spacy.prefer_gpu(), the code runs.

import scispacy
import spacy
import random
from pathlib import Path
from spacy.util import minibatch, compounding, decaying


## setup nlp & data

nlp = spacy.load('./tmp/spacy_model') # en_core_sci_lg or ./tmp/spacy_model
continue_training = True if 'textcat' in nlp.pipe_names else False

train_data = [
    ('Innovation in Database Management: Computer Science vs. Engineering',
     {'cats': {'SIGGRAPH': False,
               'VLDB': True,
               'ISCAS': False,
               'INFOCOM': False}}),
    ('High performance prime field multiplication for GPU',
     {'cats': {'SIGGRAPH': False,
               'VLDB': False,
               'ISCAS': True,
               'INFOCOM': False}}),
    ('enchanted scissors: a scissor interface for support in cutting and interactive fabrication',
     {'cats': {'SIGGRAPH': True,
               'VLDB': False,
               'ISCAS': False,
               'INFOCOM': False}}),
    ('Detection of channel degradation attack by Intermediary Node in Linear Networks',
     {'cats': {'SIGGRAPH': False,
               'VLDB': False,
               'ISCAS': False,
               'INFOCOM': True}})
]

train_data = [(nlp.tokenizer(txt), cats) for txt,cats in train_data]
model_labels = set(train_data[0][1]['cats'].keys())


## textcat setup

if not continue_training:
    textcat = nlp.create_pipe(
        'textcat',
        config={
            'exclusive_classes': True,
            'architecture': 'simple_cnn',
        }
    )
    nlp.add_pipe(textcat, last=True)

# store model labels in textcategorizer if they don't already exist
current_labels = set(nlp.get_pipe('textcat').labels)
for l in model_labels:
    if l not in current_labels:
        nlp.get_pipe('textcat').add_label(str(l))


## train

epochs = 1
textcat = nlp.get_pipe("textcat")
spacy.prefer_gpu()

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    # if base
    if continue_training:
        # Start with an existing model, use default optimizer
        optimizer = nlp.resume_training()
    else:
        optimizer = nlp.begin_training()

    # create batch sizes
    min_batch_size, max_batch_size, update_by = (1.,64.,1.001)
    batch_sizes = compounding(min_batch_size, max_batch_size, update_by)

    # create decaying dropout
    starting_dropout, ending_dropout, decay_rate = (0.6, 0.2, 1e-4)
    dropouts = decaying(starting_dropout, ending_dropout, decay_rate)

    for i in range(epochs):
        losses = {}

        # batch up the examples using spaCy's minibatch
        random.shuffle(train_data)
        batches = minibatch(train_data, size=batch_sizes)
        for batch, dropout in zip(batches, dropouts):
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)

        # export model
        outdir = Path('./tmp/spacy_model')
        if not outdir.exists():
            outdir.mkdir(parents=True)
        with nlp.use_params(optimizer.averages):
            nlp.to_disk(outdir)

jgieringer on 22 Dec 2020

👍2 👀1

All 8 comments

Thanks for the detailed report. I understand what you're trying to do, though I'm not sure yet why you're running into errors. Could you provide a minimal, standalone piece of code that exhibits the error? It would be helpful for us to try and reproduce the problem and then debug where the problem may be. The devil can be in the details ;-)

You don't have to add your full dataset or anything, you could just have one example data point on which you train, then store to disk, read back in and try training again on the same sample. The IO should work the same.

svlandeg on 19 Dec 2020

Thanks for taking a look!

I found some multi class data online which I normalized and formatted (fast text style) to the likes of my current task.

Code for the experiment is located here

Data is attached:
title_conference_train.txt
title_conference_valid.txt

Please let me know if I can provide anything else.

jgieringer on 22 Dec 2020

Hi! This is a lot of code, and I don't even know which arguments you used to invoke the main function. Could you try and trim this down into a shorter, minimal reproducible code snippet that exhibits your original error? That would help to focus on the main problem. Like I said - it's probably easier to try and reproduce your error with sample data (in code) rather than having a bunch of files, preprocessing etc.

svlandeg on 22 Dec 2020

Totally understood, my apologies!

I've trimmed the script to the code below.
Can confirm that if I remove spacy.prefer_gpu(), the code runs.

import scispacy
import spacy
import random
from pathlib import Path
from spacy.util import minibatch, compounding, decaying


## setup nlp & data

nlp = spacy.load('./tmp/spacy_model') # en_core_sci_lg or ./tmp/spacy_model
continue_training = True if 'textcat' in nlp.pipe_names else False

train_data = [
    ('Innovation in Database Management: Computer Science vs. Engineering',
     {'cats': {'SIGGRAPH': False,
               'VLDB': True,
               'ISCAS': False,
               'INFOCOM': False}}),
    ('High performance prime field multiplication for GPU',
     {'cats': {'SIGGRAPH': False,
               'VLDB': False,
               'ISCAS': True,
               'INFOCOM': False}}),
    ('enchanted scissors: a scissor interface for support in cutting and interactive fabrication',
     {'cats': {'SIGGRAPH': True,
               'VLDB': False,
               'ISCAS': False,
               'INFOCOM': False}}),
    ('Detection of channel degradation attack by Intermediary Node in Linear Networks',
     {'cats': {'SIGGRAPH': False,
               'VLDB': False,
               'ISCAS': False,
               'INFOCOM': True}})
]

train_data = [(nlp.tokenizer(txt), cats) for txt,cats in train_data]
model_labels = set(train_data[0][1]['cats'].keys())


## textcat setup

if not continue_training:
    textcat = nlp.create_pipe(
        'textcat',
        config={
            'exclusive_classes': True,
            'architecture': 'simple_cnn',
        }
    )
    nlp.add_pipe(textcat, last=True)

# store model labels in textcategorizer if they don't already exist
current_labels = set(nlp.get_pipe('textcat').labels)
for l in model_labels:
    if l not in current_labels:
        nlp.get_pipe('textcat').add_label(str(l))


## train

epochs = 1
textcat = nlp.get_pipe("textcat")
spacy.prefer_gpu()

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    # if base
    if continue_training:
        # Start with an existing model, use default optimizer
        optimizer = nlp.resume_training()
    else:
        optimizer = nlp.begin_training()

    # create batch sizes
    min_batch_size, max_batch_size, update_by = (1.,64.,1.001)
    batch_sizes = compounding(min_batch_size, max_batch_size, update_by)

    # create decaying dropout
    starting_dropout, ending_dropout, decay_rate = (0.6, 0.2, 1e-4)
    dropouts = decaying(starting_dropout, ending_dropout, decay_rate)

    for i in range(epochs):
        losses = {}

        # batch up the examples using spaCy's minibatch
        random.shuffle(train_data)
        batches = minibatch(train_data, size=batch_sizes)
        for batch, dropout in zip(batches, dropouts):
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)

        # export model
        outdir = Path('./tmp/spacy_model')
        if not outdir.exists():
            outdir.mkdir(parents=True)
        with nlp.use_params(optimizer.averages):
            nlp.to_disk(outdir)

jgieringer on 22 Dec 2020

👍2 👀1

Cool, I was able to reproduce your problem with that last script. And the good news is the fix should be pretty easy: just put spacy.prefer_gpu() all the way at the top of your script, before you load the nlp model from file :-)

svlandeg on 27 Dec 2020

🚀1

This issue has been automatically closed because it was answered and there was no follow-up discussion.