Can you provide an example of how to use spaCy for sentence segmentation? Thanks so much!
You could do something like this.
In [34]: text = u'''Python packaging is awkward at the best of times, and it鈥檚 particularly tricky with C extensions, built via Cython, requiring large data files. So, please report issues as you encounter them, and bear with me :)'''
In [35]: tokens = nlp(text, parse=True)
In [36]: for s in tokens.sents:
print ''.join(tokens[x].string for x in range(*s))
....:
Python packaging is awkward at the best of times, and it鈥檚 particularly tricky with C extensions, built via Cython, requiring large data files.
So, please report issues as you encounter them, and bear with me :)
I think it might be a good idea to make sentence segmentation more visible, as, at immediate glance, people seem to assume that it might not be easy to do or even possible.
e.g. http://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html
@honnibal what would be the spaCy way of implementing a sentence-splitter ?
The example that Jim posted above is the current solution --- the tokens.sents attribute gives a (start, end) iterator. Unfortunately the Tokens.getitem method doesn't accept a slice at the moment, so use range(start, end). This will be fixed in the next version.
The segmentation works a little differently from others. It uses the syntactic structure, not just the surface clues from the punctuation.
spaCy is unique among current parsers in parsing whole documents, instead of splitting first into sentences, and then parsing the resulting strings. This is possible because the algorithm is linear time, whereas a lot of previous parsers use polynomial time parsing algorithms.
This means the sentence boundary detection is often robust to difficult cases:
>>> tokens = nlu(u"If Harvard doesn't come through, I'll take the test to get into Yale. many parents set goals for their children, or maybe they don't set a goal.")
>>> for start, end in tokens.sents:
... print ''.join(tokens[i].string for i in range(start, end))
... print
...
If Harvard doesn't come through, I'll take the test to get into Yale.
many parents set goals for their children, or maybe they don't set a goal.
I just tried this example from the Grammarly examples, and it works correctly. Because spaCy parses this as two clauses, it puts the sentence break in the correct place, even though "many" is lower-cased.
I haven't highlighted this yet because I still haven't sorted out better training and evaluation data. I need to train models on more web text, instead of the current model which is based on the Wall Street Journal text.
All of this has been delayed by me dropping everything to do a demo for a major client. If I secure this, then I can say that spaCy's development is secure for the immediate future, and I'll even be able to hire additional help. But for the last month, things have been less smooth than I'd like. I hope to have all this sorted out soon.
Hmm. I see. Thanks for the response. I was curious because I'd love to contribute but still need to do a bit of reading on cython and going over this project more in depth.
Improved sentence segmentation now included in the latest release. Docs are updated with usage.
Hey,
the sentence segmenter for French is very bad :(
it cuts sentences in the middle right after an adjective or doesn't follow basic rule ( period followed by \n for example. Even if I replace each \n by \n\n.
I'm using that kind of templates
nlp = spacy.load('fr')
USTR_mydoc=USTR_get_1string_from_file("/dev/stdin")
doc = nlp(USTR_mydoc)
print([(w.text, w.pos_) for w in doc])
(USTR_get_1string_from_file is just a function that return a whole file in a string)
I have a similar issue with Spanish
text = """
En estad铆stica, un error sistem谩tico es aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud. Puede estar originado en un defecto del instrumento, en una particularidad del operador o del proceso de medici贸n, etc. Se contrapone al concepto de error aleatorio.
"""
nlp = spacy.load("es")
doc = spacy_nlp(text, parse=True)
for span in doc.sents:
print("#> span:", span)
this gives
`
...
`
as you can see the first "span" isn't a full sentence.
the situation is a bit better in v2.0 but still plenty of errors. After a discussion with @honnibal on twitter, turned out the French model was trained on the shuffled Sequoia treebank version as provided by the UD people so it's likely that's the segmentation learning process was messed up bc of that. Also, small treebank so not enough data point to learn it accurately.
Any way we can train the sentence segmentation on custom data? If so it would be great if someone could provide examples as well
How exactly the sentence separation works? is it based on a kind of regular expression based(considering punctuations like full stop, question marks)?
Hey @syllog1sm, I am experimenting with your approach but I am getting this error:
ValueError Traceback (most recent call last)
<ipython-input-25-edf97381b54c> in <module>()
3 range1 = lambda start, end: range(start, end+1)
4
----> 5 for start, end in en_doc.sents:
6 print(''.join(tokens[i].string for i in range(start, end)))
ValueError: too many values to unpack (expected 2)
Do you have an idea why, I am no expert in python.
I am using python 3.6 and spacy 2.0.5
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
the situation is a bit better in v2.0 but still plenty of errors. After a discussion with @honnibal on twitter, turned out the French model was trained on the shuffled Sequoia treebank version as provided by the UD people so it's likely that's the segmentation learning process was messed up bc of that. Also, small treebank so not enough data point to learn it accurately.