Spacy: How do I train sentence splitter without training DEP parser?

Created on 15 Jan 2020  路  19Comments  路  Source: explosion/spaCy

UD dep corpus is rather small, is made of single sentences per paragraph, and I'd like to train sentence splitter from DEP parser on arbitrary sentences, say, from other data sources containing sentence breaks.
How do I do it?

Which page or section is this issue related to?

Training? Training DEP?

training

All 19 comments

For training data without real paragraphs there's a spacy convert option that lets you group sentences into fake paragraphs so that the parser has a chance to learn sentence boundaries:

spacy convert -n 10

We use this option for most of the UD corpora behind spacy's provided models. In theory the converter could also support the UD document and paragraph markers, but there are so many UD/CoNLL-U corpora that don't have them and it doesn't seem like something that spacy necessarily needs to support.

The next major version of spacy (v3?) will have a new component called SentenceRecognizer that you can train just from sentence-segmented data. Underneath it's just a tagger that tags words as either sentence start or not. It's much smaller and faster than the parser, with slightly higher precision and slightly lower recall in the tasks I've tried so far. (Prodigy will also support sentence boundary annotation as one option for gathering custom training data for this task.)

If you'd like to experiment with it, it should be working in the develop branch, and feedback is very welcome! If you have data in spacy's training format (you just need "orth" for each token at a minimum), you can train a model with spacy train -p sentrec. (Please keep in mind that develop is not going to be stable! At the very least sentrec is a terrible name that will change in the future...)

That new component system (Morphologizer, SentRec) is really looking great! Thanks, I'll try!
Backporting to 2.1/2.2 also looks trivial for SentRec -- just update the .is_sent_start from this new model.

I think that backporting is a little more challenging than it looks because of how the gold annotations are passed around as a tuple during training in v2. This has been improved a lot with a new class called Example in develop, which is why a number of new components have only been added in develop and there aren't plans to backport any of them.

Well, first of all, I want to support existing v2.1/v2.2 right now, that's why I consider using existing components and maybe several tricks, and backporting could be just one of them.
Can't we just set it as an additional model in the pipeline, taking a copy from one of POS/NER/DEP? We just need binary yes/no output per token.
And reg training backporting... you know, CLI training is broken anyway in v2.1/v2.2, so I had to write my own wrappers for almost everything.
The list of the issues I had:

  • GoldCorpus is so memory-hungry.
  • NER training peeks for possible moves, missing some of them, and crashes.
  • only single NER and one textcat can be supported with GoldParse
  • no way to choose GPU device from CLI.

Sure, you can subclass Tagger to create a new component that assigns is_sent_start instead of tag pretty easily. Training is still an issue because of the annotation tuples that are passed around, though. There are dozens of relatively hidden spots where the annotation tuples are unpacked and it's a major pain when you're adding new attributes like sent_start or morph.

But the sent-start-tagger is an especially easy case to implement. Here's my original proof-of-concept that just modified Tagger directly to test the idea:

https://github.com/adrianeboyd/spaCy/commit/45e08aa5e6f26703f6bb95abe4e0127a361aa374

(This is just barely enough to train and test a tagger and doesn't account for all the tuples that will break elsewhere in other components because of the changes.) And #4713 shows the version with Example that went into develop.

About the issues you mentioned:

  • GoldCorpus is so memory-hungry.

Spacy's not really designed with huge training corpora in mind, and I suspect the memory usage is going be slightly worse with the Example class. Where did you run into problems?

  • NER training peeks for possible moves, missing some of them, and crashes.

That is a problem! It's definitely not great that some of the options are hard-coded in the NER model. How did you work around it in your wrapper?

  • only single NER and one textcat can be supported with GoldParse

You can only train a single model of each type at one time, but I think it should be relatively easy to rename a model afterwards if you need multiple models in a pipeline. The NER components should only make modifications to entities that were previous unset or "O", but I haven't tested this kind of setup that much.

  • no way to choose GPU device from CLI.

You can choose the GPU device with spacy train -g 0, -g 1, etc. I also haven't tested this personally because I only have one GPU, but the option is passed straight to cupy underneath.

Sure, you can subclass Tagger to create a new component that assigns is_sent_start instead of tag pretty easily. Training is still an issue because of the annotation tuples that are passed around, though. There are dozens of relatively hidden spots where the annotation tuples are unpacked and it's a major pain when you're adding new attributes like sent_start or morph.
for (ids, words, tags, heads, labels, ner, sent_start_tags), brackets in sents:

Why do you have to unpack them btw? Store them in a dict / namedtuple / class with slots.
Don't modify old ones, you don't need update the code -- put only newer ones into this new dict.

But the sent-start-tagger is an especially easy case to implement. Here's my original proof-of-concept that just modified Tagger directly to test the idea:

adrianeboyd@45e08aa

Thanks a lot, exactly my thoughts.
Can we set this up in python though? If you need fast and modular code, bulk setting .sent_start for a doc seems a better idea (like we have with .ents getter/setter).

(This is just barely enough to train and test a tagger and doesn't account for all the tuples that will break elsewhere in other components because of the changes.) And #4713 shows the version with Example that went into develop.

About the issues you mentioned:

  • GoldCorpus is so memory-hungry.

Spacy's not really designed with huge training corpora in mind, and I suspect the memory usage is going be slightly worse with the Example class. Where did you run into problems?

I used GoldCorpora for two large NER datasets, and got:
220 Mb -> 2.5 GB json file -> 7-10 GB in memory
2.5 Gb -> 9 GB json file -> ~50 GB in memory, got OOM
So I created my own format that looks like this:
image
Now my files take 220 Mb and 2.5 Gb, and less than 10 GB in memory (5x reduction).
Loading and saving for this format, and wrappers for GoldCorpora:
https://github.com/buriy/spacy-ru/blob/v2.1/utils/corpus.py#L48

  • NER training peeks for possible moves, missing some of them, and crashes.

That is a problem! It's definitely not great that some of the options are hard-coded in the NER model. How did you work around it in your wrapper?

I load full dataset with my wrappers, collect all labels, and add them to the models. Then train.
There's some problem with GoldCorpus labels collection in your code, but it doesn't happen on smaller datasets for some reason. I will create an issue to reproduce it, with my stack trace and training data attached.

  • only single NER and one textcat can be supported with GoldParse

You can only train a single model of each type at one time, but I think it should be relatively easy to rename a model afterwards if you need multiple models in a pipeline. The NER components should only make modifications to entities that were previous unset or "O", but I haven't tested this kind of setup that much.

Yes, I'm doing this, but there's no way to merge two NER models and the usage becomes much more complicated (a wrapper is needed to use them, then, doc._.ents1 + doc._.ents2, then, token._.ent_type2 instead of token.ent_type for secondary models, etc -- so, no generalization for the client-side code that uses this API!). I haven't yet investigated how to fix displacy for that, I guess, doc.ents = doc._.ents1 + doc._.ents2 hack is needed (if they're not intersecting, of course).

  • no way to choose GPU device from CLI.

You can choose the GPU device with spacy train -g 0, -g 1, etc. I also haven't tested this personally because I only have one GPU, but the option is passed straight to cupy underneath.

But in the docs you have only this:

    use_gpu=("Use GPU", "option", "g", int),

like if it can be GPU or CPU.

https://spacy.io/api/cli#train :

--use-gpu,聽-g | option | Whether to use GPU. Can be either聽0,聽1聽or聽-1.
What does it mean by "0, 1 or -1"? How about a 4-GPU machine?

Ah yes, remembered another one.
https://github.com/explosion/spaCy/blob/master/spacy/cli/train.py#L237

    if base_model:
        # Start with an existing model, use default optimizer
        optimizer = create_default_optimizer(Model.ops)  ### <<<--- here
    else:
        # Start with a blank model, call begin_training
        optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)

Always loads existing model on the device it was trained, even if --use-gpu is set/unset?
Or maybe loads it into CPU, but only if you have no GPU...
Not sure what logic is, but it ignores --use-gpu parameter and I've found no model.to(device) method yet (like in PyTorch).
I think a fix should be like adding util.use_gpu(...) near https://github.com/explosion/spaCy/blob/master/spacy/cli/train.py#L150

Thanks for all your feedback!

Why do you have to unpack them btw? Store them in a dict / namedtuple / class with slots.
Don't modify old ones, you don't need update the code -- put only newer ones into this new dict.

It's just an old clunky design, which is why we've replaced it with a class in develop.

But the sent-start-tagger is an especially easy case to implement. Here's my original proof-of-concept that just modified Tagger directly to test the idea:
adrianeboyd@45e08aa

Thanks a lot, exactly my thoughts.
Can we set this up in python though? If you need fast and modular code, bulk setting .sent_start for a doc seems a better idea (like we have with .ents getter/setter).

You can set Token.is_sent_start and the values are None/True/False. Having multiple components that can set sentence boundaries gets a little complicated, see #4775, which is still undecided. You also have to watch out if you've already run a parser and then start modifying sentence boundaries, but I think the main use cases for SentenceRecognizer or Sentencizer is when you aren't using a parser, either because you don't have one or it's too slow.

Yesterday I trained an xx SentenceRecognizer for all the CC BY-SA UD corpora for languages that spacy supports (29 languages, 41 corpora), and the results for ru_gsd-ud-dev seem acceptable (caveat: I have fake document boundaries every 100 sentences, so this score may be up to 1% inflated):

Sent P    96.21 
Sent R    96.55 
Sent F    96.38

Sent F for sentencizer is 95 for the same data, so that's not a huge improvement, but I mainly hope it's easier to create custom models from much larger training corpora. I initially just made the model as small as I could with performance approaching the parser for English and it's about 10x faster than the parser.

About the issues you mentioned:

  • GoldCorpus is so memory-hungry.

Spacy's not really designed with huge training corpora in mind, and I suspect the memory usage is going be slightly worse with the Example class. Where did you run into problems?

I used GoldCorpora for two large NER datasets, and got:
220 Mb -> 2.5 GB json file -> 7-10 GB in memory
2.5 Gb -> 9 GB json file -> ~50 GB in memory, got OOM
So I created my own format that looks like this:
image
Now my files take 220 Mb and 2.5 Gb, and less than 10 GB in memory (5x reduction).
Loading and saving for this format, and wrappers for GoldCorpora:
https://github.com/buriy/spacy-ru/blob/v2.1/utils/corpus.py#L48

Ah, the problem is mainly the JSON format, not GoldCorpus exactly. We're aware, but it's a major change and the details are hard, see #2928.

  • NER training peeks for possible moves, missing some of them, and crashes.

That is a problem! It's definitely not great that some of the options are hard-coded in the NER model. How did you work around it in your wrapper?

I load full dataset with my wrappers, collect all labels, and add them to the models. Then train.
There's some problem with GoldCorpus labels collection in your code, but it doesn't happen on smaller datasets for some reason. I will create an issue to reproduce it, with my stack trace and training data attached.

Working with smaller datasets, I think I've run into more problems with the min_action_freq setting, but both min_action_freq and the hard-coded peeking into 1000 documents will cause problems for some datasets and not provide useful error messages, which is not great.

  • only single NER and one textcat can be supported with GoldParse

You can only train a single model of each type at one time, but I think it should be relatively easy to rename a model afterwards if you need multiple models in a pipeline. The NER components should only make modifications to entities that were previous unset or "O", but I haven't tested this kind of setup that much.

Yes, I'm doing this, but there's no way to merge two NER models and the usage becomes much more complicated (a wrapper is needed to use them, then, doc._.ents1 + doc._.ents2, then, token._.ent_type2 instead of token.ent_type for secondary models, etc -- so, no generalization for the client-side code that uses this API!). I haven't yet investigated how to fix displacy for that, I guess, doc.ents = doc._.ents1 + doc._.ents2 hack is needed (if they're not intersecting, of course).

If you have two NER components, they shouldn't clobber existing entities, so I think you can have a pipeline with ner1 and ner2 that are both ordinary EntityRecognizer models and both modify the token-level ent_type/ent_iob without issues?

Because of the restriction on overlapping entities, I could see using ._.ents1 and ._.ents2 in certain cases, but if you're merging them together in the end, maybe you can avoid the custom extensions and it would simplify your pipeline a bit?

You can choose the GPU device with spacy train -g 0, -g 1, etc. I also haven't tested this personally because I only have one GPU, but the option is passed straight to cupy underneath.

But in the docs you have only this:

    use_gpu=("Use GPU", "option", "g", int),

like if it can be GPU or CPU.

https://spacy.io/api/cli#train :

--use-gpu, -g | option | Whether to use GPU. Can be either 0, 1 or -1.
What does it mean by "0, 1 or -1"? How about a 4-GPU machine?

Hmm, the docs should be improved here. It's -g GPU_ID so -g -1 is no GPU, -g 0 is GPU 0, -g 1 is GPU 1. The GPU support is not particularly sophisticated and I don't think it supports more than one GPU per training process.

Ah yes, remembered another one.
https://github.com/explosion/spaCy/blob/master/spacy/cli/train.py#L237

    if base_model:
        # Start with an existing model, use default optimizer
        optimizer = create_default_optimizer(Model.ops)  ### <<<--- here
    else:
        # Start with a blank model, call begin_training
        optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)

Always loads existing model on the device it was trained, even if --use-gpu is set/unset?
Or maybe loads it into CPU, but only if you have no GPU...
Not sure what logic is, but it ignores --use-gpu parameter and I've found no model.to(device) method yet (like in PyTorch).
I think a fix should be like adding util.use_gpu(...) near https://github.com/explosion/spaCy/blob/master/spacy/cli/train.py#L150

Oh, I hadn't seen this interaction with the base_model option, that's a bug. I'll look into it. I think you're right that you just need to call util.use_gpu() before the base model is loaded.

But the sent-start-tagger is an especially easy case to implement. Here's my original proof-of-concept that just modified Tagger directly to test the idea:

Can we set this up from python?
If you need fast and modular code, bulk setting .sent_start for a doc seems a better idea (like we have with .ents getter/setter).

You can set Token.is_sent_start and the values are None/True/False. Having multiple components that can set sentence boundaries gets a little complicated, see #4775, which is still undecided. You also have to watch out if you've already run a parser and then start modifying sentence boundaries, but I think the main use cases for SentenceRecognizer or Sentencizer is when you aren't using a parser, either because you don't have one or it's too slow.

Sorry, I mean, how do I reuse Tagger model for this from python, not from Cython? It writes to .tags for some reason, that needs to be removed, rather the storage should be configurable...

Sent F for sentencizer is 95 for the same data, so that's not a huge improvement, but I mainly hope it's easier to create custom models from much larger training corpora. I initially just made the model as small as I could with performance approaching the parser for English and it's about 10x faster than the parser.

That's not much for Russian, but we don't have more DEP data... So, how do we solve the tie of sentence splitter to the DEP data?
This simple rule-based approach would get 99%:
1) "[.?!]+" -> True
2) "\b\w." -> False
3) and a list of exceptions for the common abbreviations:
"\b(word). [邪-褟褢]" -> False

( https://en.wikipedia.org/wiki/Abbreviation#Periods_(full_stops)_and_spacescontractions )
Maybe it's "[.?!]+" -> "\n" noise that confuses dependency parser that much?
For a comparison with Russian rule-based tables for tokens and sentences, see https://github.com/natasha/razdel#%D1%82%D0%BE%D0%BA%D0%B5%D0%BD%D1%8B
One day I'm going to update it with spacy tokenizer experiments, but right now I'm focused on adding vectors to improve POS and DEP quality to be close to Russian SOTA.

Ah, the problem is mainly the JSON format, not GoldCorpus exactly. We're aware, but it's a major change and the details are hard, see #2928.

That's a great plan! Fingers crossed!

"tokens": [
{"start": 0, "end": 5, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 1},

You might also consider adding "norm": "Apple" and "morph": {...} to the tokens list.
Also from that issue, I didn't get what's an actual blocker to implementing this, and why it's hard to implement.

If you have two NER components, they shouldn't clobber existing entities, so I think you can have a pipeline with ner1 and ner2 that are both ordinary EntityRecognizer models and both modify the token-level ent_type/ent_iob without issues?

1) you need to move .ents to some storage before applying another EntityRecognizer
2) you need to copy it back later

Well, yes, you can put them back to token.ent_type, but only with custom handling like doc.ents = doc.ents + doc._.ents2, because setting .ents will override all tokens types with 'O'.
And where would you put this code?
Just try to write it and you'll see how complicated it is now in the pipeline (https://github.com/explosion/spaCy/issues/3933 ).
Actually I also don't see a point why entities can't intersect each other in Spacy.
I think this will work better for most uses (if not all):

class Entity: ## Cython-class, slots-class or NamedTuple, I use slots-class here for brevity.
    __slots__ = ('start', 'end', 'ent_type') ## start and end are token IDs here, only token knows its char position in the doc raw text.
token.ents = {ent_typeA: ent_spanA, ent_typeB: ent_spanB}  # is a part of these entities
# mass-set it, like you do right now: doc.ents = [ent_spanA, ent_spanB]

A single EntityRecognizer model could add only non-intersectable entities it has internally, so you can expect token.ents to contain not more than 1 tag for each entity.
But EntityRuler would add possibly-intersecting entities as it does right now.
Also, multiple models would set possibly-intersecting entities.
So if you have EntityRecognizer markup which has intersecting entities, you just split it into several models, train all, and that's it.

I think a fix should be like adding util.use_gpu(...) near https://github.com/explosion/spaCy/blob/master/spacy/cli/train.py#L150

Oh, I hadn't seen this interaction with the base_model option, that's a bug. I'll look into it. I think you're right that you just need to call util.use_gpu() before the base model is loaded.

Generally, I think begin_training() abstraction is too heavy and this function does much more than it should.
One needs to split it into several functions.
It initializes the pipeline models if needed, it changes vectors, it initializes an optimizer, it sets up optimizer params, it changes model GPU assignment...

And one more question to https://github.com/explosion/spaCy/blob/master/spacy/cli/train.py#L405
Why evaluation runs on CPU even if training is on GPU?

with Model.use_device("cpu"):
    nlp_loaded = util.load_model_from_path(epoch_model_path)
    ...
    scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)

I think you'll like the updates being made for thinc and spacy that will make configuring your own models much easier, see a preview here: #4920.

spacy train with GPU:

The evaluation is just run again on CPU to get the timing information for the meta output. You can skip it.

SentenceRecognizer for UD Russian corpora:

It looks like UD Russian GSD has some preprocessing issues that lead to poor tokenization, so it wasn't a good example. Training a SentenceRecognizer just on UD Russian SynTagRus the f-score is 98.8.

Spacy's rule-based Sentencizer is only at 91.2 for the same data because of what looks like dialogue and list items marked with -, which the rule-based Sentencizer doesn't handle correctly.

Overlapping entities:

The entities are stored at the token level and each token only has one spot to store entity information in ent_type and ent_iob. Restricting the predictions to one tag per token makes the transition-based NER model much simpler and faster. The NER model preserves any non-O tags that it transitions over, leaving any existing tags in place. There were some bugs in v2.1 (it would sometimes get into an invalid state and stop predicting any new entities), but these were fixed for v2.2, and if you wanted to backport the changes to v2.1 I think it would be pretty easy.

If you want more control over how overlapping entities are handled you'd have to do what you're doing, where you save the current entities elsewhere, reset everything to O, predict, and then merge the entities according to your priorities.

You are currently restricted to setting entities through doc.ents in python, I think mainly to help keep everything in a consistent state. Well, that's not quite accurate, you can currently modify Token.ent_type but not Token.ent_iob (see #4790). doc.entsonly reads the type off the B token, so if you make the I types inconsistent you don't really notice.

I think you'll like the updates being made for thinc and spacy that will make configuring your own models much easier, see a preview here: #4920.

Thanks. I think my next experiment will be a custom Toc2Vec component...
Lemma vectors + morpho-syntactic traits from the dictionary should improve POS quality from 93% to 97-98% and it won't take 2-5 GB in memory (and disk too).
I think of something like
word -> lemmas -> average lemma vector -> compose
word -> traits list -> average traits vector -> compose
Only lemma vectors might get 95-96% for POS maybe.

But I'm most worried about syntax parsing quality -- with current 88%/84% it's almost 2x more errors than more typical 92%/89% (Russian SOTA is 94%/90% with Stanford parser).

Training a SentenceRecognizer just on UD Russian SynTagRus the f-score is 98.8

Cool, that's a good number, though I think there were still not enough examples... Duplicate the dataset several times before applying --n-sents transformation and it will be improved to >99%.

spacy train with GPU:
The evaluation is just run again on CPU to get the timing information for the meta output. You can skip it.

I just want it on GPU (for all but last epoch if time is needed -- don't think anyone would need it though!). Running it on CPU when GPU is 3-5x faster => it will take 2x more time for training the model.

The entities are stored at the token level and each token only has one spot to store entity information in ent_type and ent_iob.

That's my suggestion exactly -- you can change it to store information for multiple entries there.

Restricting the predictions to one tag per token makes the transition-based NER model much simpler and faster.

But only when you can do so... Practicality beats purity (c) Zen of python
If you have multiple NER datasets instead of one -- you're out of luck!
If you have intersecting entries in multiple datasets -- we're sorry, pal, just go away!
Say, my clients ask NER for Dates, Addresses, Locations, and Persons... how do I do it?
I have a very large addresses list, I have a very large person names list, I have persons & locations NER dataset, I have dates NER dataset (a different one!).
Locations are usually parts of the addresses... but persons and dates don't intersect with addresses and locations.

I have made an attempt to learn a single NER model from multiple datasets (for non-intersecting entries!!!), with a usual parser that would work well, but with a transitional based parser we have it never seen Date-{U,L] -> Person-{U,B} and Person-{U,L] -> Date-{U,B} transitions, so I think that's why it's getting stuck sometimes -- but maybe as you say it's also a 2.1-specific issue, I'll try 2.2 for that too.
Next, I'm going to try a teacher-student kind of training, that will transfer trained entries into a different dataset. That should work.
However, for that I additionally need to learn addresses on text chunks, rather than on full sentences... I'm clueless here right now.

I think you'll like the updates being made for thinc and spacy that will make configuring your own models much easier, see a preview here: #4920.

Thanks. I think my next experiment will be a custom Toc2Vec component...

This is definitely what should become easier!

spacy train with GPU:
The evaluation is just run again on CPU to get the timing information for the meta output. You can skip it.

I just want it on GPU (for all but last epoch if time is needed -- don't think anyone would need it though!). Running it on CPU when GPU is 3-5x faster => it will take 2x more time for training the model.

spacy train has gotten a bit bloated over time. We do all of the training for spacy's default models on CPU, which is why the options with GPU haven't gotten as much testing / optimization.

Restricting the predictions to one tag per token makes the transition-based NER model much simpler and faster.

But only when you can do so... Practicality beats purity (c) Zen of python
If you have multiple NER datasets instead of one -- you're out of luck!
If you have intersecting entries in multiple datasets -- we're sorry, pal, just go away!
Say, my clients ask NER for Dates, Addresses, Locations, and Persons... how do I do it?
I have a very large addresses list, I have a very large person names list, I have persons & locations NER dataset, I have dates NER dataset (a different one!).
Locations are usually parts of the addresses... but persons and dates don't intersect with addresses and locations.

I have made an attempt to learn a single NER model from multiple datasets (for non-intersecting entries!!!), with a usual parser that would work well, but with a transitional based parser we have it never seen Date-{U,L] -> Person-{U,B} and Person-{U,L] -> Date-{U,B} transitions, so I think that's why it's getting stuck sometimes -- but maybe as you say it's also a 2.1-specific issue, I'll try 2.2 for that too.
Next, I'm going to try a teacher-student kind of training, that will transfer trained entries into a different dataset. That should work.
However, for that I additionally need to learn addresses on text chunks, rather than on full sentences... I'm clueless here right now.

Definitely try v2.2 if you're combining NER models. The bugs in v2.1 were pretty major. If you want to use spacy, you'll have to think about how to split up the entity types and combine NER models in a way that makes sense for what you need in the end, and doc.ents won't be enough to represent all your spans.

Otherwise, I think you're moving into areas that are beyond what spacy is likely to support (soon, anyway). I'd suggest looking into recent research on nested named entity recognition, e.g.: https://www.aclweb.org/anthology/D18-1124.pdf or https://www.aclweb.org/anthology/P19-1510.pdf

Tentatively closing this as it looks like most important issues have been discussed / resolved and there are no current action points open? Feel free to re-open if there are.

FYI: @svlandeg @adrianeboyd I ended up with this approach for NER:
https://github.com/buriy/active_ner/blob/master/anno/vec.py +
https://github.com/buriy/active_ner/blob/master/anno/ner.py
and a similar approach for TextCat:
https://github.com/buriy/spacy-ru/blob/v2.2/vec/vec.py#L12

Next I will update POS, DEP and sentence segmenter in similar way.
(more morphological features will be added later to this custom Tok2Vec)

It would be great if you could make this easier to implement without replacing too much spacy code.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings