Hello spaCy team!
It appears that there isn't an option to determine whether any single token is part of a noun chunk (as determined from doc.noun_chunks), in the same way as token.ent_iob.
The main problem that I am trying to solve is merging noun_chunks in specific sentences.
Is this a feature that could be added? Or is there another solution?
Why not just:
noun_words = set(w.i for ent in doc.noun_chunks for w in ent)
for word in doc:
if word.i in noun_words:
print("Is noun", word.text)
Are noun chunks and entities not different though? Or have I misunderstood your solution?
For example:
In [38]: text = nlp(u'Mary had a little lamb')
In [39]: [nc for nc in text.noun_chunks]
Out[39]: [Mary, a little lamb]
In [40]: [ent for ent in text.ents]
Out[40]: [Mary]
Sorry, mistyped. I edited my snippet.
Of course, I see now. So the best solution to find noun chunks in a sentence would be:
noun_words = set(w.i for nc in doc.noun_chunks for w in nc)
sents = list(doc.sents)
noun_from_chunks_in_sentence = [w for w in sents[2] if w in noun_words]
It would be great to be able to reconstruct noun chunks from tokens in a sentence. Is it possible to have adjacent noun_phrases? In that case, this method would be ambiguous.
Perhaps some more context will make my problem clearer:
I am taking spacy tokens and wrapping them in my own token object (without a reference to the document object). These tokens are stored together as sentences. Using the iob tags, I can choose to merge these entities later on, without needing the doc object.
However, I cannot reconstruct or merge noun chunks from token level properties.
Your method could be used to make a dictionary perhaps:
noun_chunk_dict = {word_in_noun_chunk: root_of_noun_chunk, ... }
then all words with the same root of the noun chunk could be merged.
Perhaps some more context will make my problem clearer:
I am taking spacy tokens and wrapping them in my own token object (without a reference to the document object). These tokens are stored together as sentences. Using the iob tags, I can choose to merge my token wrappers into entities without needing the doc object.
However, I cannot reconstruct or merge noun chunks from token level properties.
Your method could be used to identify words in a noun chunk, perhaps also including an index for the noun chunk. Then storing these in a dictionary such as: {token_index: noun_chunk_index}. Which could then be used to group tokens together into chunks.
There's a token.doc attribute, which might help you. You could also pretend that an noun chunk is a special type of entity:
doc.ents = [('NP', e.start, e.end) for e in doc.noun_chunks]
This should _add_ these entities to the document, and should set the IOB appropriately. (Note that it doesn't replace the entities, which is what I would've guessed this call would do. I mean to fix this in some way in future, as I think it's currently confusing.)
Thanks for the great suggestion. From your description I wasn't sure if this is the behaviour that I should expect:
In [11]: doc = nlp(u'The Bank of England appointed Mark Carney to be its governor. A brave decision.')
In [12]: [ent for ent in doc.ents]
Out[12]: [The Bank of England, Mark Carney]
In [13]: [np for np in doc.noun_chunks]
Out[13]: [The Bank, England, Mark Carney, its governor]
In [14]: doc.ents = [(100, nc.start, nc.end) for nc in doc.noun_chunks]
In [15]: [ent for ent in doc.ents]
Out[15]: [The Bank, England, Mark Carney, its governor]
Hmm.
I forgot that a named entity can be a noun chunk — so you're going to clobber your entity labels here, but only sometimes. Not good.
You'll want to filter the noun_chunks such that the named entities are excluded. You could do this a bunch of ways. This seems simplest to me:
ents = set((ent.start, ent.end) for ent in doc.ents)
noun_chunks = [(100, nc.start, nc.end) for nc in doc.noun_chunks if (nc.start, nc.end) not in ents]
doc.ents = noun_chunks
Trying to tie noun_chunks to specific sentences is a feature I'm trying to build as well. I altered your code @owlas for those trying to run it live on a document.
noun_words = set(w.i for nc in doc.noun_chunks for w in nc)
for s in doc.sents:
noun_from_chunks_in_sentence = [w for w in s if w.i in noun_words]
print s.text
print noun_from_chunks_in_sentence
I'm assuming this just needs to be changed to check the left and right most child of the noun chunk is inside the sentence left/right most child.
for s in doc.sents:
print s.text
for nc in doc.noun_chunks:
if nc.start >= s.start and nc.end <= s.end:
print "INSIDE: " + nc.text
else:
print "OUT: " + nc.text
print s.start, s.end
print nc.start, nc.end
print ""
This is a tidy solution, I'll try this out.
Maybe we should have a span2doc function? I think this might take some pressure off the span objects.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Maybe we should have a
span2docfunction? I think this might take some pressure off the span objects.