Spacy: Adding exceptions to sentencizer

Created on 21 Aug 2019  路  14Comments  路  Source: explosion/spaCy

I am not sure if I haven't look thoroughly enough in the docs but I want to add abbreviation exceptions to the sentence tokenizer.

E.g. Operating income incl. JV was SEK 2.1 b. with an operating margin of 4.0% is split into Operating income incl. and JV was SEK 2.1 b. with an operating margin of 4.0%.

My experience so far with spaCy tells me that there is probably a smart way to fix it?

Posted as bug but it might be doc related or a feature request.

from spacy.lang.en import English

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp('Operating income incl. JV was SEK 2.1 b. with an operating margin of 4.0%')

assert len([s for s in doc.sents]) == 1

Info about spaCy

  • spaCy version: 2.1.8
  • Platform: Linux-5.0.0-25-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.3
enhancement feat / sentencizer perf / accuracy

Most helpful comment

This will be in v3.

All 14 comments

Yes, it would be a good idea to improve the sentencizer. Right now it's extremely simple and it would be nice if it was at least comparable to the punkt sentence tokenizer, which you've probably used in NLTK.

For now, it's probably best to use the parser if it's not too slow for your task. You can also set the parser up as the "sentencizer" component if you need to (instructions from Matt, to work with the spacy-pytorch-transformers example scripts for instance):

nlp = spacy.load("en_pytt_bertbaseuncased_lg")
nlp.remove_pipe("sentencizer")
nlp2 = spacy.load("en", vocab=nlp.vocab)
nlp.add_pipe(nlp2.get_pipe("parser"), name="sentencizer", first=True)

The sentencizer is currently very simple, yeah. It shouldn't be very difficult to implement your own component if you want to have custom rules. I'm not sure whether we want to build the current one out to support different rule-sets. One option would be to have a component based on the Matcher logic, so you could write matcher rules. I'm really not sure that would be superior to alternatives though, e.g. regex might actually be better.

I do think having an option to use Punkt would be good though. I think in many situations it's better than what we provide, if you need sentence boundary detection and prefer speed to accuracy. So we can at least provide that option, since the algorithm is pretty simple.

I looked at the NLTK code and it is a good bit more complicated than I thought overall, or at least the code has gotten more complicated over time. Would you want to have a wrapper for NLTK punkt or reimplement something like it?

(Not that it's worse than sentencizer, but I really don't like the fact that the sentence boundary tokens are hard-coded in NLTK punkt.)

I think we would want to reimplement, maybe based on their implementation --- but obviously we wouldn't want to depend on NLTK, and we have different code conventions from them, e.g. we wouldn't make the trainer a different class.

I thought about this some more and punkt isn't the right kind of algorithm for tokenized text, anyway.

Would you consider a teeny tagger model for the sentencizer? I tried it out with two tags (sent_start vs. not) and ontonotes and it seemed to work relatively well. (I haven't written the full evaluation code yet, but evaluated as a tagger and inspecting results it looks acceptable.)

I'm not sure what your speed/size requirements would be? The smallest model that worked okay was 64K. I updated spacy-benchmarks a bit (extremely hackily) and for tokenizer+sbd it's about twice as fast as NLTK. It's ~4x faster than 'en' parsing.

Again, I'd need to write a better evaluation, but only if you think it's worth pursuing.

If the question was towards me then speed and ease of use is important factors. Not sure if the question was for @honnibal though.

Sorry, that was confusing, the question was for @honnibal. I think speed is going to be the main factor here.

@adrianeboyd I think that'd be a fine solution. We might even be able to tune the width of the CNN downwards, so that the model is even faster. Probably we would make this a subclass of Tagger, so that it can do the set_annotations properly, and also so it can use the sentence boundaries in the gold standard?

I tried to implement it with Sentencizer as a subclass of Tagger, but couldn't figure out some of the details related to serialization. So, the mock-up I have has it replace Tagger in the pipeline. Training from the sentences in the gold standard is not difficult. Here's the basic sketch:

https://github.com/adrianeboyd/spaCy/tree/feature/sentence-tagger

I tried to reduce everything as much as I could:

self.cfg.setdefault("cnn_maxout_pieces", 1)
self.cfg.setdefault("subword_features", False)
self.cfg.setdefault("token_vector_width", 6)
self.cfg.setdefault("conv_depth", 1)
self.cfg.setdefault("hidden_width", 4)
self.cfg.setdefault("pretrained_vectors", None)

Again, I haven't done a good evaluation yet.

Is there anything I can do to help on this?

It's actually basically done on develop:

https://github.com/explosion/spaCy/blob/8137b24928432c7c23ea66d190584336075e29ae/spacy/pipeline/pipes.pyx#L758-L921

develop is currently under pretty heavy development (mainly due to the new rewrite of thinc), but you're welcome to try it out. You should be able to train models with spacy train -p sentrec and the JSON training format as long as you have orth for each of the tokens in each sentence. The shortcut name is probably also going to change from sentrec to senter, unless someone comes up with a better name in the meanwhile.

Wait, hmm, it looks like it hasn't been updated for some of the very recent changes, but if you try it as of about https://github.com/explosion/spaCy/commit/d2f3a44b42bfff9773fdf3abaccdcc0e78d295f7, it should work.

I've also worked on prodigy recipes for it, which isn't too much work because it's just a variant of pos, but those will have to wait until prodigy is updated for spacy v3.

Sounds great. Is it planned to be included in the next release, or?

This will be in v3.

Awesome work, Adriane, I'll go ahead and close this issue as it's basically done (just waiting release ;-)).

Was this page helpful?
0 / 5 - 0 ratings