Spacy: span.orth_ != span.text

Created on 19 Nov 2017  Â·  2Comments  Â·  Source: explosion/spaCy

According to the API docs, Span.orth_ should be identical to Span.text, but that's not what I'm seeing:

>>> import spacy

>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('The black cat purrs.')
>>> span = doc[1: 3]
>>> span.text
'black cat'
>>> span.orth_
'blackcat'

When I peek into the source code, I see the difference: The text property joins the span's constituent tokens via u''.join([t.text_with_ws for t in self]), which includes trailing whitespace, while orth_ uses ''.join([t.orth_ for t in self]).strip(), which does not.

I could submit a PR for this — would you believe it, I've never submitted a PR here?! — but I don't know how you'd prefer the change to go.

Your Environment

  • spaCy version: 2.0.3
  • Platform: Darwin-16.7.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Models: de, de_core_news_sm, en, en_core_web_md, en_core_web_sm, en_md, es, es_core_news_sm, xx, xx_ent_wiki_sm
bug

Most helpful comment

Thanks! Not sure how this happened, but yes, Span.orth_ should match Span.text and include whitespace.

could submit a PR for this — would you believe it, I've never submitted a PR here?!

Yes, even for that reason alone you definitely should 😜

All 2 comments

Thanks! Not sure how this happened, but yes, Span.orth_ should match Span.text and include whitespace.

could submit a PR for this — would you believe it, I've never submitted a PR here?!

Yes, even for that reason alone you definitely should 😜

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings