Spacy: Rethink storage of parse tree in C data

Created on 31 Mar 2018 · 4Comments · Source: explosion/spaCy

Goal: Make underlying tree storage slightly more efficient, less confusing.

There's a lot of ways to store a tree. The one in spaCy was what I came up with off the top of my head when building my previous parser, during my research. I think we can probably improve this, especially since when I designed this initially, the trees were constrained to have no crossing branches. This is no longer true.

Current solution

The underlying TokenC struct stores the parse tree like this:


cdef struct TokenC:
    int32_t head # Offset of head, i.e. if attached to word 2 previous, head is -2
    uint32_t l_kids # Count of left children
    uint32_t r_kids # Count of right children
    uint32_t l_edge # Leftmost point of *yield* -- i.e. leftmost descendent
    uint32_t r_edge # Rightmost point of yield -- i.e. rightmost descendent

So the current baseline uses 160 bits per node for the tree information. I understand there's a huge chapter in Knuth's TAOCP about approaches to this. (I should really buy that...)

Requirements

We need to support the following methods:

cdef const TokenC* get_head(const TokenC* child) nogil

cdef const TokenC* get_left_child(const TokenC* head, int i) nogil

cdef const TokenC* get_right_child(const TokenC* head, int i) nogil

cdef int get_left_edge(const TokenC* head) nogil

cdef int get_right_edge(const TokenC* head) nogil

cdef int count_left_children(const TokenC* head) nogil

cdef int count_right_children(const TokenC* head) nogil

cdef void add_arc(TokenC* tokens, int head, int child, uint64_t label) nogil

Terminology and example

I'm using child to mean a direct descendant, and subtree to mean all direct and indirect descendents. So, let's say we had a tree like this:

        0
       /    \
      1.0    1.1
    /           \
   1.0.0      1.1.0

(The leaves are ordered 1.0.0, 1.0, 0, 1.1, 1.1.0 in the string)

The children of 0 are 1.0 and 1.1.
The head of 1.0 is 0.
The head of 1.1 is 0.
The subtree of 0 is (1.0.0, 1.0, 0, 1.1, 1.1.0)
The left edge of 0 is 1.0.0
The right edge of 0 is 1.1.0
The left edge of 1.0 is 1.0.0
The left and right edges of 1.0.0 is 1.0.0

What's where

For now it's best to just work in a distinct module, and get tests passing. If we get a good result, the integration will happen in the spacy.structs module, and then spacy/tokens/token.pyx and spacy/tokens/doc.pyx. Later we can also update code in spacy/syntax/_state.pxd.

enhancement help wanted

Source

honnibal

Most helpful comment

@skrcode Here's one of those data structure/algorithms tasks I mentioned :)

honnibal on 31 Mar 2018

👍2

All 4 comments

@skrcode Here's one of those data structure/algorithms tasks I mentioned :)

honnibal on 31 Mar 2018

👍2

@honnibal 1. I did a check on the number of places l_kids and r_kids are being used. I don't see any reason why they are required. To check if the node is a leaf node, we could always make a check if the node is equal to its left and right edges. span checks if the node is a leaf node by looking at both the left and right child counts.

Token class uses TokenC struct. What are the pros and cons of wrapping a class around the struct node? Could we add the contents of TokenC to Token class itself?
I feel that the overall tree representation is quite efficient. TAOCP section 2.3 on Trees had the overall notion of storing nodes into contiguous memory. In addition, if the nodes are doubly linked, we could get an even faster runtime by storing left and right child offsets. This would also solve point 1 to check if the node is a leaf node.
What do you think about doing away with the offset and going with absolute values?
Won't having crossing branches change the existing oracle arc eager algorithm? Wouldn't more constructs be added to the proposed APIs such as spanning tree algorithms etc.?

skrcode on 6 Apr 2018

@skrcode

From view of efficiency, you're right. From view of software design, checking #of left children and #of right children provides a higher level of abstraction rather than referencing the data structure(in my humble opinion, of course :blush: ) An even better idea is to implement an is_leaf method in Token class directly. Then one can remove l_kids and r_kids fields.

@honnibal I really find the current data structure very elegant, one l_edge and one r_edge; only the offsets. Calculation of lefts and rights are then straightforward, just increment/decrement until the current token is met.

Problems

OK, with the cross edges I problems are

calculation of lefts and rights
subtree calculations

I made a small sketch of the situation with the example sentences from the German morphology page with the Link.

Proposed Solution (not really)

is a traditional tree. We replace l_edge and r_edge pair with array counterparts, instead of holding one offset each, we hold an array of children offsets.

from cymem.cymem cimport Pool

cdef struct TokenC:
    cdef Pool mem

    int32_t head # Offset of head
    uint32_t l_kids # Count of left children
    uint32_t r_kids # Count of right children

    uint32_t* l_children # array of leftmost children, for instance [9]
    uint32_t* r_children # array of rightmost children for instance [5,7]

Requires more space and more iteration, I don't think it's a good solution but I offered it as a baseline for new ideas.

My Opinion

I don't know the percentage of the languages that has cross-dependencies. It seems to me that it does not worth changing the current data structure. A non-tricky tree will be slower due to array iterations, also most probably a small percentage of the sentences will have cross edges. @honnibal what do you say?