Spacy: Finding sentence sub-chunks

Created on 7 Feb 2016 · 4Comments · Source: explosion/spaCy

In the documents and tutorials the main thing I haven't found is examples on how to break sentences down into small sub thoughts/chunks. The noun_chunks is handy, but having examples on using the token.head to find small (near-complete) sentence chunks would be neat.

Lets take the example sentence on https://api.spacy.io/displacy/index.html

displaCy uses CSS and JavaScript to show you how computers understand language

This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:

[displaCy] uses CSS and Javascript [to + show]
&
show you how computers understand [language]

I'm assuming that we can use the token.head to build these groups. In one of your examples you had the following function.

def dependency_labels_to_root(token):
    '''Walk up the syntactic tree, collecting the arc labels.'''
    dep_labels = []
    while token.head is not token:
        dep_labels.append(token.dep)
        token = token.head
    return dep_labels

usage

Source

Xeoncross

Most helpful comment

The easiest way is to find the head of the subtree you want, and then use the .subtree, .children, .lefts and .rights iterators. .subtree is the one that does what you're asking for most directly:

from spacy.en import English
nlp = English()
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
    if word.dep_ in ('xcomp', 'ccomp'):
        print(''.join(w.text_with_ws for w in word.subtree))

It'd probably be better for word.subtree to return a Span object instead of a generator over the tokens. If you want the Span you can get it via the .right_edge and .left_edge properties. The Span object is nice because you can easily get a vector, merge it, etc.

doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
    if word.dep_ in ('xcomp', 'ccomp'):
        subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
        print(subtree_span.text, '|', subtree_span.root.text)
        print(subtree_span.similarity(doc))
        print(subtree_span.similarity(subtree_span.root))

You might also want to select a head, and then select a start and end position by walking along its children. You could then take the .left_edge and .right_edge of those tokens, and use it to calculate a span.

honnibal on 7 Feb 2016

👍3

All 4 comments

from spacy.en import English
nlp = English()
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
    if word.dep_ in ('xcomp', 'ccomp'):
        print(''.join(w.text_with_ws for w in word.subtree))

doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
    if word.dep_ in ('xcomp', 'ccomp'):
        subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
        print(subtree_span.text, '|', subtree_span.root.text)
        print(subtree_span.similarity(doc))
        print(subtree_span.similarity(subtree_span.root))

honnibal on 7 Feb 2016

👍3

Thank you very much! This was exactly what I was attempting to find. I really appreciate the work you've put into spaCy. Also, thank you for the comments on different hacker news submissions. It's very handy to have someone that knows this field explaining NLP best-practices and current methods.

Xeoncross on 8 Feb 2016

Any ideas how this could work in german? the dep_ tag set is different and there are no "clauses" as far as I can see. I guess often, one could split at a comma but sometimes that wouldn't do it