Spacy: Sentence tokenization in spaCy?

Created on 9 Sep 2015 · 12Comments · Source: explosion/spaCy

While trying to do sentence tokenization in spaCy, I ran into the following problem while trying to tokenize sentences:

from __future__ import unicode_literals, print_function
from spacy.en import English
nlp = English()
doc = nlp('Hello, world. Here are two sentences.')
sentence = doc.sents.next()

It was unclear to me how to get the text of the sentence object. I tried using dir() to find a method that would allow this and was unsuccessful. Any code that I have found from others trying to do sentence tokenization doesn't seem to function properly.

Source

c0nn3r

Most helpful comment

Argh.

The last snippet I sent you was wrong --- sorry, it's late and I was hasty.

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The Doc object has an attribute, sents, which gives you Span objects for the sentences.

honnibal on 9 Sep 2015

👍31 🎉5

All 12 comments

The sentence is a Span object: http://spacy.io/docs/#api

The attributes on the Span object are a slightly rough part of the API at the moment. A Span.text attribute will be in upcoming versions.

For now, the best way is to do:

''.join(token.string for token in sentence)

The .string attribute includes trailing whitespace, so this will give you the original text of the sentence, verbatim.

There's also a sentence.string attribute. However, this also includes the trailing whitespace --- so you'll need to do sentence.string.strip() to get just the string.

If this all seems really weird: the over-arching idea is that you should usually only need the string for output. That's why the padding is there --- to make it easy to construct the original string from the tokens, so that mark-up can be attached to the original string.

The thing that I've tried to make really easy in the API is using the token objects. So, if your use case really needs the sentence string, and can't work with the tokens, I'd be interested to understand it a bit better.

honnibal on 9 Sep 2015

My goal currently is to use spaCy to parse this:

raw_text = 'Hello, world. Here are two sentences.'

to this resulting output:

sentences = [ 'Hello, world.', 'Here are two sentences.' ]

c0nn3r on 9 Sep 2015

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc]

honnibal on 9 Sep 2015

👎6 ❤1 🎉1 😄1

My result was:

[u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

Is there further work needed to find the resulting sentences?

c0nn3r on 9 Sep 2015

Did you install the data?

python -m spacy.en.download all

honnibal on 9 Sep 2015

Tried reinstalling data, but the code is still returning the same result.

c0nn3r on 9 Sep 2015

I also am running into the same problem. Originally, I had not downloaded the dataset, but downloading it didn't fix the problem.
It still responds with the same result:

[u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

widdakay on 9 Sep 2015

Argh.

The last snippet I sent you was wrong --- sorry, it's late and I was hasty.

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The Doc object has an attribute, sents, which gives you Span objects for the sentences.

honnibal on 9 Sep 2015

👍31 🎉5

Thank you! That worked.

Now it responds with this, as expected.

[u'Hello, world.', u'Here are two sentences.']

widdakay on 9 Sep 2015

:+1: Thank you so much! It works!

c0nn3r on 9 Sep 2015

I have the same problem with spacy 2.0.7

def sentence_tokenize(text):
doc = nlp(text)
return [sent.string.strip() for sent in doc.sents]

sentence_tokenize("Navn den er kjøpt i er Sigrid Trasti")

['Navn den', 'er kjøpt', 'i er', 'Sigrid Trasti']

lynochka on 15 Mar 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] on 7 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to flag sentences with possible multiple meanings

armsp · 3Comments

nlp.vocab.set_vector fails to add new vocab in vectors_fast_text.py

ahalterman · 3Comments

why the performance of lemmatizing of spacy is so slow compared with nltk

tonywangcn · 3Comments

TypeError when calling similarity() with a loop including a single-letter token

norrishd · 3Comments

Issue with handling empty strings in spaCy 2.0.0a6

melanietosik · 3Comments