Spacy: Sentence tokenization in spaCy?

Created on 9 Sep 2015  路  12Comments  路  Source: explosion/spaCy

While trying to do sentence tokenization in spaCy, I ran into the following problem while trying to tokenize sentences:

from __future__ import unicode_literals, print_function
from spacy.en import English
nlp = English()
doc = nlp('Hello, world. Here are two sentences.')
sentence = doc.sents.next()

It was unclear to me how to get the text of the sentence object. I tried using dir() to find a method that would allow this and was unsuccessful. Any code that I have found from others trying to do sentence tokenization doesn't seem to function properly.

Most helpful comment

Argh.

The last snippet I sent you was wrong --- sorry, it's late and I was hasty.

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The Doc object has an attribute, sents, which gives you Span objects for the sentences.

All 12 comments

The sentence is a Span object: http://spacy.io/docs/#api

The attributes on the Span object are a slightly rough part of the API at the moment. A Span.text attribute will be in upcoming versions.

For now, the best way is to do:

''.join(token.string for token in sentence)

The .string attribute includes trailing whitespace, so this will give you the original text of the sentence, verbatim.

There's also a sentence.string attribute. However, this also includes the trailing whitespace --- so you'll need to do sentence.string.strip() to get just the string.

If this all seems really weird: the over-arching idea is that you should usually only need the string for output. That's why the padding is there --- to make it easy to construct the original string from the tokens, so that mark-up can be attached to the original string.

The thing that I've tried to make really easy in the API is using the token objects. So, if your use case really needs the sentence string, and can't work with the tokens, I'd be interested to understand it a bit better.

My goal currently is to use spaCy to parse this:

raw_text = 'Hello, world. Here are two sentences.'

to this resulting output:

sentences = [ 'Hello, world.', 'Here are two sentences.' ]
from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc]

My result was:

[u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

Is there further work needed to find the resulting sentences?

Did you install the data?

python -m spacy.en.download all

Tried reinstalling data, but the code is still returning the same result.

I also am running into the same problem. Originally, I had not downloaded the dataset, but downloading it didn't fix the problem.
It still responds with the same result:

[u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

Argh.

The last snippet I sent you was wrong --- sorry, it's late and I was hasty.

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The Doc object has an attribute, sents, which gives you Span objects for the sentences.

Thank you! That worked.

Now it responds with this, as expected.

[u'Hello, world.', u'Here are two sentences.']

:+1: Thank you so much! It works!

I have the same problem with spacy 2.0.7

def sentence_tokenize(text):
doc = nlp(text)
return [sent.string.strip() for sent in doc.sents]

sentence_tokenize("Navn den er kj酶pt i er Sigrid Trasti")

['Navn den', 'er kj酶pt', 'i er', 'Sigrid Trasti']

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

smartinsightsfromdata picture smartinsightsfromdata  路  3Comments

enerrio picture enerrio  路  3Comments

ank-26 picture ank-26  路  3Comments

nadachaabani1 picture nadachaabani1  路  3Comments

peterroelants picture peterroelants  路  3Comments