Spacy: Noun chunk info from token

Created on 28 Jul 2016  Â·  13Comments  Â·  Source: explosion/spaCy

Hello spaCy team!

It appears that there isn't an option to determine whether any single token is part of a noun chunk (as determined from doc.noun_chunks), in the same way as token.ent_iob.

The main problem that I am trying to solve is merging noun_chunks in specific sentences.

Is this a feature that could be added? Or is there another solution?

Most helpful comment

Maybe we should have a span2doc function? I think this might take some pressure off the span objects.

All 13 comments

Why not just:

noun_words = set(w.i for ent in doc.noun_chunks for w in ent)
for word in doc:
  if word.i in noun_words:
    print("Is noun", word.text)

Are noun chunks and entities not different though? Or have I misunderstood your solution?

For example:

In [38]: text = nlp(u'Mary had a little lamb')

In [39]: [nc for nc in text.noun_chunks]
Out[39]: [Mary, a little lamb]

In [40]: [ent for ent in text.ents]
Out[40]: [Mary]

Sorry, mistyped. I edited my snippet.

Of course, I see now. So the best solution to find noun chunks in a sentence would be:

noun_words = set(w.i for nc in doc.noun_chunks for w in nc)
sents = list(doc.sents)
noun_from_chunks_in_sentence = [w for w in sents[2] if w in noun_words]

It would be great to be able to reconstruct noun chunks from tokens in a sentence. Is it possible to have adjacent noun_phrases? In that case, this method would be ambiguous.

Perhaps some more context will make my problem clearer:

I am taking spacy tokens and wrapping them in my own token object (without a reference to the document object). These tokens are stored together as sentences. Using the iob tags, I can choose to merge these entities later on, without needing the doc object.

However, I cannot reconstruct or merge noun chunks from token level properties.

Your method could be used to make a dictionary perhaps:

noun_chunk_dict = {word_in_noun_chunk: root_of_noun_chunk, ... }

then all words with the same root of the noun chunk could be merged.

Perhaps some more context will make my problem clearer:

I am taking spacy tokens and wrapping them in my own token object (without a reference to the document object). These tokens are stored together as sentences. Using the iob tags, I can choose to merge my token wrappers into entities without needing the doc object.

However, I cannot reconstruct or merge noun chunks from token level properties.

Your method could be used to identify words in a noun chunk, perhaps also including an index for the noun chunk. Then storing these in a dictionary such as: {token_index: noun_chunk_index}. Which could then be used to group tokens together into chunks.

There's a token.doc attribute, which might help you. You could also pretend that an noun chunk is a special type of entity:

doc.ents = [('NP', e.start, e.end) for e in doc.noun_chunks]

This should _add_ these entities to the document, and should set the IOB appropriately. (Note that it doesn't replace the entities, which is what I would've guessed this call would do. I mean to fix this in some way in future, as I think it's currently confusing.)

Thanks for the great suggestion. From your description I wasn't sure if this is the behaviour that I should expect:

In [11]: doc = nlp(u'The Bank of England appointed Mark Carney to be its governor. A brave decision.')

In [12]: [ent for ent in doc.ents]
Out[12]: [The Bank of England, Mark Carney]

In [13]: [np for np in doc.noun_chunks]
Out[13]: [The Bank, England, Mark Carney, its governor]

In [14]: doc.ents = [(100, nc.start, nc.end) for nc in doc.noun_chunks]

In [15]: [ent for ent in doc.ents]
Out[15]: [The Bank, England, Mark Carney, its governor]

Hmm.

I forgot that a named entity can be a noun chunk — so you're going to clobber your entity labels here, but only sometimes. Not good.

You'll want to filter the noun_chunks such that the named entities are excluded. You could do this a bunch of ways. This seems simplest to me:

ents = set((ent.start, ent.end) for ent in doc.ents)
noun_chunks = [(100, nc.start, nc.end) for nc in doc.noun_chunks if (nc.start, nc.end) not in ents]
doc.ents = noun_chunks

Trying to tie noun_chunks to specific sentences is a feature I'm trying to build as well. I altered your code @owlas for those trying to run it live on a document.

noun_words = set(w.i for nc in doc.noun_chunks for w in nc)
for s in doc.sents:
    noun_from_chunks_in_sentence = [w for w in s if w.i in noun_words]
    print s.text
    print noun_from_chunks_in_sentence

I'm assuming this just needs to be changed to check the left and right most child of the noun chunk is inside the sentence left/right most child.

for s in doc.sents:
    print s.text
    for nc in doc.noun_chunks:
        if nc.start >= s.start and nc.end <= s.end:
            print "INSIDE: " + nc.text
        else:
            print "OUT: " + nc.text
        print s.start, s.end
        print nc.start, nc.end
        print ""

This is a tidy solution, I'll try this out.

Maybe we should have a span2doc function? I think this might take some pressure off the span objects.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings