There's a lot of ways to store a tree. The one in spaCy was what I came up with off the top of my head when building my previous parser, during my research. I think we can probably improve this, especially since when I designed this initially, the trees were constrained to have no crossing branches. This is no longer true.
The underlying TokenC struct stores the parse tree like this:
cdef struct TokenC:
int32_t head # Offset of head, i.e. if attached to word 2 previous, head is -2
uint32_t l_kids # Count of left children
uint32_t r_kids # Count of right children
uint32_t l_edge # Leftmost point of *yield* -- i.e. leftmost descendent
uint32_t r_edge # Rightmost point of yield -- i.e. rightmost descendent
So the current baseline uses 160 bits per node for the tree information. I understand there's a huge chapter in Knuth's TAOCP about approaches to this. (I should really buy that...)
We need to support the following methods:
cdef const TokenC* get_head(const TokenC* child) nogil
cdef const TokenC* get_left_child(const TokenC* head, int i) nogil
cdef const TokenC* get_right_child(const TokenC* head, int i) nogil
cdef int get_left_edge(const TokenC* head) nogil
cdef int get_right_edge(const TokenC* head) nogil
cdef int count_left_children(const TokenC* head) nogil
cdef int count_right_children(const TokenC* head) nogil
cdef void add_arc(TokenC* tokens, int head, int child, uint64_t label) nogil
I'm using child to mean a direct descendant, and subtree to mean all direct and indirect descendents. So, let's say we had a tree like this:
0
/ \
1.0 1.1
/ \
1.0.0 1.1.0
(The leaves are ordered 1.0.0, 1.0, 0, 1.1, 1.1.0 in the string)
The children of 0 are 1.0 and 1.1.
The head of 1.0 is 0.
The head of 1.1 is 0.
The subtree of 0 is (1.0.0, 1.0, 0, 1.1, 1.1.0)
The left edge of 0 is 1.0.0
The right edge of 0 is 1.1.0
The left edge of 1.0 is 1.0.0
The left and right edges of 1.0.0 is 1.0.0
For now it's best to just work in a distinct module, and get tests passing. If we get a good result, the integration will happen in the spacy.structs module, and then spacy/tokens/token.pyx and spacy/tokens/doc.pyx. Later we can also update code in spacy/syntax/_state.pxd.
@skrcode Here's one of those data structure/algorithms tasks I mentioned :)
@honnibal 1. I did a check on the number of places l_kids and r_kids are being used. I don't see any reason why they are required. To check if the node is a leaf node, we could always make a check if the node is equal to its left and right edges. span checks if the node is a leaf node by looking at both the left and right child counts.
@skrcode
is_leaf method in Token class directly. Then one can remove l_kids and r_kids fields. @honnibal I really find the current data structure very elegant, one l_edge and one r_edge; only the offsets. Calculation of lefts and rights are then straightforward, just increment/decrement until the current token is met.
OK, with the cross edges I problems are
I made a small sketch of the situation with the example sentences from the German morphology page with the Link.
is a traditional tree. We replace l_edge and r_edge pair with array counterparts, instead of holding one offset each, we hold an array of children offsets.
from cymem.cymem cimport Pool
cdef struct TokenC:
cdef Pool mem
int32_t head # Offset of head
uint32_t l_kids # Count of left children
uint32_t r_kids # Count of right children
uint32_t* l_children # array of leftmost children, for instance [9]
uint32_t* r_children # array of rightmost children for instance [5,7]
Requires more space and more iteration, I don't think it's a good solution but I offered it as a baseline for new ideas.
I don't know the percentage of the languages that has cross-dependencies. It seems to me that it does not worth changing the current data structure. A non-tricky tree will be slower due to array iterations, also most probably a small percentage of the sentences will have cross edges. @honnibal what do you say?
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@skrcode Here's one of those data structure/algorithms tasks I mentioned :)