Spacy: span.orth_ != span.text

Created on 19 Nov 2017 · 2Comments · Source: explosion/spaCy

According to the API docs, Span.orth_ should be identical to Span.text, but that's not what I'm seeing:

>>> import spacy

>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('The black cat purrs.')
>>> span = doc[1: 3]
>>> span.text
'black cat'
>>> span.orth_
'blackcat'

When I peek into the source code, I see the difference: The text property joins the span's constituent tokens via u''.join([t.text_with_ws for t in self]), which includes trailing whitespace, while orth_ uses ''.join([t.orth_ for t in self]).strip(), which does not.

I could submit a PR for this — would you believe it, I've never submitted a PR here?! — but I don't know how you'd prefer the change to go.

Your Environment

spaCy version: 2.0.3
Platform: Darwin-16.7.0-x86_64-i386-64bit
Python version: 3.6.0
Models: de, de_core_news_sm, en, en_core_web_md, en_core_web_sm, en_md, es, es_core_news_sm, xx, xx_ent_wiki_sm

bug

Source

bdewilde

Most helpful comment

Thanks! Not sure how this happened, but yes, Span.orth_ should match Span.text and include whitespace.

could submit a PR for this — would you believe it, I've never submitted a PR here?!

Yes, even for that reason alone you definitely should 😜

ines on 20 Nov 2017

😄1 👍1

All 2 comments

Thanks! Not sure how this happened, but yes, Span.orth_ should match Span.text and include whitespace.

could submit a PR for this — would you believe it, I've never submitted a PR here?!

Yes, even for that reason alone you definitely should 😜

ines on 20 Nov 2017

😄1 👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] on 8 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Custom component not executing when calling nlp.pipe

enerrio · 3Comments

Details/paper used for recent NER implementation

muzaluisa · 3Comments

PhraseMatcher returns only 1 match while more than 1 rules are verified

cverluise · 3Comments

TypeError when calling similarity() with a loop including a single-letter token

norrishd · 3Comments

How to flag sentences with possible multiple meanings

armsp · 3Comments