Spacy: Sentence segmentation example?

Created on 7 Feb 2015 · 12Comments · Source: explosion/spaCy

Can you provide an example of how to use spaCy for sentence segmentation? Thanks so much!

Source

yang

Most helpful comment

the situation is a bit better in v2.0 but still plenty of errors. After a discussion with @honnibal on twitter, turned out the French model was trained on the shuffled Sequoia treebank version as provided by the UD people so it's likely that's the segmentation learning process was messed up bc of that. Also, small treebank so not enough data point to learn it accurately.

dseddah on 14 Nov 2017

👍3

All 12 comments

You could do something like this.

In [34]: text = u'''Python packaging is awkward at the best of times, and it’s particularly tricky with C extensions, built via Cython, requiring large data files. So, please report issues as you encounter them, and bear with me :)'''

In [35]: tokens = nlp(text, parse=True)

In [36]: for s in tokens.sents:
    print ''.join(tokens[x].string for x in range(*s))
   ....:
Python packaging is awkward at the best of times, and it’s particularly tricky with C extensions, built via Cython, requiring large data files.
So, please report issues as you encounter them, and bear with me :)

geovedi on 16 Feb 2015

I think it might be a good idea to make sentence segmentation more visible, as, at immediate glance, people seem to assume that it might not be easy to do or even possible.

e.g. http://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html

@honnibal what would be the spaCy way of implementing a sentence-splitter ?

metasyn on 1 Apr 2015

The example that Jim posted above is the current solution --- the tokens.sents attribute gives a (start, end) iterator. Unfortunately the Tokens.getitem method doesn't accept a slice at the moment, so use range(start, end). This will be fixed in the next version.

The segmentation works a little differently from others. It uses the syntactic structure, not just the surface clues from the punctuation.

spaCy is unique among current parsers in parsing whole documents, instead of splitting first into sentences, and then parsing the resulting strings. This is possible because the algorithm is linear time, whereas a lot of previous parsers use polynomial time parsing algorithms.

This means the sentence boundary detection is often robust to difficult cases:

>>> tokens = nlu(u"If Harvard doesn't come through, I'll take the test to get into Yale. many parents set goals for their children, or maybe they don't set a goal.")
>>> for start, end in tokens.sents:
...   print ''.join(tokens[i].string for i in range(start, end))
...   print
... 
If Harvard doesn't come through, I'll take the test to get into Yale. 

many parents set goals for their children, or maybe they don't set a goal.

I just tried this example from the Grammarly examples, and it works correctly. Because spaCy parses this as two clauses, it puts the sentence break in the correct place, even though "many" is lower-cased.

I haven't highlighted this yet because I still haven't sorted out better training and evaluation data. I need to train models on more web text, instead of the current model which is based on the Wall Street Journal text.

All of this has been delayed by me dropping everything to do a demo for a major client. If I secure this, then I can say that spaCy's development is secure for the immediate future, and I'll even be able to hire additional help. But for the last month, things have been less smooth than I'd like. I hope to have all this sorted out soon.

syllog1sm on 1 Apr 2015

Hmm. I see. Thanks for the response. I was curious because I'd love to contribute but still need to do a bit of reading on cython and going over this project more in depth.

metasyn on 2 Apr 2015

Improved sentence segmentation now included in the latest release. Docs are updated with usage.

honnibal on 13 Apr 2015

Hey,
the sentence segmenter for French is very bad :(
it cuts sentences in the middle right after an adjective or doesn't follow basic rule ( period followed by \n for example. Even if I replace each \n by \n\n.
I'm using that kind of templates

nlp = spacy.load('fr')                
USTR_mydoc=USTR_get_1string_from_file("/dev/stdin")   
doc = nlp(USTR_mydoc)    
print([(w.text, w.pos_) for w in doc])

(USTR_get_1string_from_file is just a function that return a whole file in a string)

dseddah on 29 Oct 2017

👍2

I have a similar issue with Spanish

text = """
En estadística, un error sistemático es aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud. Puede estar originado en un defecto del instrumento, en una particularidad del operador o del proceso de medición, etc. Se contrapone al concepto de error aleatorio.
"""
nlp = spacy.load("es")
doc = spacy_nlp(text, parse=True)
for span in doc.sents:
    print("#> span:", span)

this gives

> span: En estadística, un error sistemático es

> span: aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud.

...
`

as you can see the first "span" isn't a full sentence.

mansilla on 14 Nov 2017

dseddah on 14 Nov 2017

👍3

Any way we can train the sentence segmentation on custom data? If so it would be great if someone could provide examples as well

sirjan13 on 4 Dec 2017

👍2

How exactly the sentence separation works? is it based on a kind of regular expression based(considering punctuations like full stop, question marks)?

pramod2157 on 5 Mar 2018

Hey @syllog1sm, I am experimenting with your approach but I am getting this error:

ValueError                                Traceback (most recent call last)
<ipython-input-25-edf97381b54c> in <module>()
      3 range1 = lambda start, end: range(start, end+1)
      4 
----> 5 for start, end in en_doc.sents:
      6     print(''.join(tokens[i].string for i in range(start, end)))

ValueError: too many values to unpack (expected 2)

Do you have an idea why, I am no expert in python.
I am using python 3.6 and spacy 2.0.5