According to the API docs, Span.orth_ should be identical to Span.text, but that's not what I'm seeing:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('The black cat purrs.')
>>> span = doc[1: 3]
>>> span.text
'black cat'
>>> span.orth_
'blackcat'
When I peek into the source code, I see the difference: The text property joins the span's constituent tokens via u''.join([t.text_with_ws for t in self]), which includes trailing whitespace, while orth_ uses ''.join([t.orth_ for t in self]).strip(), which does not.
I could submit a PR for this — would you believe it, I've never submitted a PR here?! — but I don't know how you'd prefer the change to go.
Thanks! Not sure how this happened, but yes, Span.orth_ should match Span.text and include whitespace.
could submit a PR for this — would you believe it, I've never submitted a PR here?!
Yes, even for that reason alone you definitely should 😜
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Thanks! Not sure how this happened, but yes,
Span.orth_should matchSpan.textand include whitespace.Yes, even for that reason alone you definitely should 😜