pip install spacy[cuda110], scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_lg-0.3.0.tar.gz
This week I successfully trained a TextCategorizer on gpu.
The following is the gist of the training code:
spacy.prefer_gpu()
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes): # only train textcat
# if base
if continue_training:
# Start with an existing model, use default optimizer
optimizer = nlp.resume_training()
else:
optimizer = nlp.begin_training()
...
for i in range(epochs):
for batch, dropout in zip(batches, dropouts):
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
# store final model
with nlp.use_params(optimizer.averages):
nlp.to_disk(output_dir)
The model was saved to directory spacy_ps_20201216.
Inside that directory are the following artifacts:
meta.json, textcat, tokenizer, vocab
Now, I want to load that same model and continue training.
From my understanding, I think the code, nlp.resume_training(), would do the trick; but I'm having issues either loading or updating the textcat to continue such training.
Here are the multiple ways I've attempted loading the textcat model, and their respective errors:
base_model = 'spacy_ps_20201216'
nlp = spacy.load(base_model)
Note: This works until the script reaches the update step
Traceback (most recent call last):
File "spacy_textcat.py", line 310, in <module>
plac.call(main)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "spacy_textcat.py", line 295, in main
train_textcat(nlp,
File "spacy_textcat.py", line 186, in train_textcat
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "pipes.pyx", line 1020, in spacy.pipeline.pipes.TextCategorizer.update
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 30, in begin_update
res, bp_res = func.begin_update((X, lengths))
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 53, in mean_pool
output = ops.mean_pool(X, lengths)
File "thinc/neural/ops.pyx", line 733, in thinc.neural.ops.NumpyOps.mean_pool
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'cupy.core.core.ndarray'
I then tried to load the initial model, __then__ add the loaded textcat from the base_model.
nlp = spacy.load('en_core_sci_lg')
nlp.add_pipe(spacy.load(base_model).get_pipe('textcat'))
Note: This works on loading, but when update line is reached, I receive a familiar error.
Traceback (most recent call last):
File "spacy_textcat.py", line 307, in <module>
plac.call(main)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "spacy_textcat.py", line 292, in main
train_textcat(nlp,
File "spacy_textcat.py", line 183, in train_textcat
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "pipes.pyx", line 1020, in spacy.pipeline.pipes.TextCategorizer.update
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 30, in begin_update
res, bp_res = func.begin_update((X, lengths))
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 53, in mean_pool
output = ops.mean_pool(X, lengths)
File "thinc/neural/ops.pyx", line 733, in thinc.neural.ops.NumpyOps.mean_pool
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'cupy.core.core.ndarray'
I then hopped out of the script to try some other ways:
>>> base_model = 'spacy_ps_20201216/textcat'
>>> nlp = spacy.load('en_core_sci_lg')
>>> textcat = spacy.pipeline.TextCategorizer(nlp.vocab)
>>> textcat.from_disk(base_model)
ValueError: Can't read file: spacy_ps_20201216/textcat/vocab/strings.json
I then moved spacy_ps_20201216/vocab into spacy_ps_20201216/textcat and reran successfully.
I updated my script code to have that change and received the following error when updating:
Traceback (most recent call last):
File "spacy_textcat.py", line 312, in <module>
plac.call(main)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "spacy_textcat.py", line 297, in main
train_textcat(nlp,
File "spacy_textcat.py", line 188, in train_textcat
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "pipes.pyx", line 1016, in spacy.pipeline.pipes.TextCategorizer.update
File "pipes.pyx", line 88, in spacy.pipeline.pipes.Pipe.require_model
ValueError: [E109] Model for component 'textcat' not initialized. Did you forget to load a model, or forget to call begin_training()?
Back to the terminal with base_model='spacy_ps_20201216':
>>> from spacy.lang.en import English
>>> nlp = English().from_disk(base_model,exclude=["parser","tagger", "ner"])
>>> nlp.pipe_names
[]
>>> nlp = spacy.load('en_core_sci_lg')
>>> nlp.from_disk(base_model,exclude=["parser","tagger", "ner"])
<spacy.lang.en.English object at 0x7f9a02cd8e50>
>>> nlp.pipe_names
['tagger', 'parser', 'ner']
>>> nlp = nlp.from_disk(base_model,exclude=["parser","tagger", "ner"])
>>> nlp.pipe_names
['tagger', 'parser', 'ner']
I would appreciate any pointers on the right way to load a trained TextCategorizer and continue training.
Thanks for the detailed report. I understand what you're trying to do, though I'm not sure yet why you're running into errors. Could you provide a minimal, standalone piece of code that exhibits the error? It would be helpful for us to try and reproduce the problem and then debug where the problem may be. The devil can be in the details ;-)
You don't have to add your full dataset or anything, you could just have one example data point on which you train, then store to disk, read back in and try training again on the same sample. The IO should work the same.
Thanks for taking a look!
I found some multi class data online which I normalized and formatted (fast text style) to the likes of my current task.
Code for the experiment is located here
Data is attached:
title_conference_train.txt
title_conference_valid.txt
Please let me know if I can provide anything else.
.
Hi! This is a lot of code, and I don't even know which arguments you used to invoke the main function. Could you try and trim this down into a shorter, minimal reproducible code snippet that exhibits your original error? That would help to focus on the main problem. Like I said - it's probably easier to try and reproduce your error with sample data (in code) rather than having a bunch of files, preprocessing etc.
Totally understood, my apologies!
I've trimmed the script to the code below.
Can confirm that if I remove spacy.prefer_gpu(), the code runs.
import scispacy
import spacy
import random
from pathlib import Path
from spacy.util import minibatch, compounding, decaying
## setup nlp & data
nlp = spacy.load('./tmp/spacy_model') # en_core_sci_lg or ./tmp/spacy_model
continue_training = True if 'textcat' in nlp.pipe_names else False
train_data = [
('Innovation in Database Management: Computer Science vs. Engineering',
{'cats': {'SIGGRAPH': False,
'VLDB': True,
'ISCAS': False,
'INFOCOM': False}}),
('High performance prime field multiplication for GPU',
{'cats': {'SIGGRAPH': False,
'VLDB': False,
'ISCAS': True,
'INFOCOM': False}}),
('enchanted scissors: a scissor interface for support in cutting and interactive fabrication',
{'cats': {'SIGGRAPH': True,
'VLDB': False,
'ISCAS': False,
'INFOCOM': False}}),
('Detection of channel degradation attack by Intermediary Node in Linear Networks',
{'cats': {'SIGGRAPH': False,
'VLDB': False,
'ISCAS': False,
'INFOCOM': True}})
]
train_data = [(nlp.tokenizer(txt), cats) for txt,cats in train_data]
model_labels = set(train_data[0][1]['cats'].keys())
## textcat setup
if not continue_training:
textcat = nlp.create_pipe(
'textcat',
config={
'exclusive_classes': True,
'architecture': 'simple_cnn',
}
)
nlp.add_pipe(textcat, last=True)
# store model labels in textcategorizer if they don't already exist
current_labels = set(nlp.get_pipe('textcat').labels)
for l in model_labels:
if l not in current_labels:
nlp.get_pipe('textcat').add_label(str(l))
## train
epochs = 1
textcat = nlp.get_pipe("textcat")
spacy.prefer_gpu()
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes): # only train textcat
# if base
if continue_training:
# Start with an existing model, use default optimizer
optimizer = nlp.resume_training()
else:
optimizer = nlp.begin_training()
# create batch sizes
min_batch_size, max_batch_size, update_by = (1.,64.,1.001)
batch_sizes = compounding(min_batch_size, max_batch_size, update_by)
# create decaying dropout
starting_dropout, ending_dropout, decay_rate = (0.6, 0.2, 1e-4)
dropouts = decaying(starting_dropout, ending_dropout, decay_rate)
for i in range(epochs):
losses = {}
# batch up the examples using spaCy's minibatch
random.shuffle(train_data)
batches = minibatch(train_data, size=batch_sizes)
for batch, dropout in zip(batches, dropouts):
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
# export model
outdir = Path('./tmp/spacy_model')
if not outdir.exists():
outdir.mkdir(parents=True)
with nlp.use_params(optimizer.averages):
nlp.to_disk(outdir)
Cool, I was able to reproduce your problem with that last script. And the good news is the fix should be pretty easy: just put spacy.prefer_gpu() all the way at the top of your script, before you load the nlp model from file :-)
This issue has been automatically closed because it was answered and there was no follow-up discussion.
@svlandeg that did the trick! Thanks for your time and help.
Sure, happy to hear it works now!
Most helpful comment
Totally understood, my apologies!
I've trimmed the script to the code below.
Can confirm that if I remove
spacy.prefer_gpu(), the code runs.