Spacy: Slow entity merging on large texts

Created on 23 Nov 2016  Â·  15Comments  Â·  Source: explosion/spaCy

First, I'd like to say that I'm really impressed with what you guys have been doing with spacy and the 1.X releases look great.

My issue is that when I do entity merging on a large document (novel length) it takes an unbelievably long time (as much as 10 minutes).

This takes about 12 seconds on four novels:

def tokenize(s):
    return [tok.lemma_ for tok in nlp(s, parse=False) if not tok.is_space]

But this takes up to 50 minutes on the same inputs:

def tokenize_merge(s):
    doc = nlp(s, parse=False)
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, ent.label_)
    return [tok.lemma_ for tok in doc if not tok.is_space]

(It actually took 24 minutes on a system running 0.101 and 51 minutes on a comperable machine running 1.2.0 but I only ran it a couple of times on each machine so that's not really conclusive.)

The second function is adapted from the old documentation and the sense2vec code.

My theory, based on documentation that says that token indices are changed when merging occurs, is that merge is O(n) on the length of the document. Rather than try and dive into the code I thought I'd just ask, what is going on here?

Is there a better way to do this than what I'm doing? It looks like some of the new callback functionality may be able to perform the merging for me? What if I can guarantee that the document will never be sliced into, it will only ever be iterated over from the beginning, so it doesn't matter if the indices are messed up?

I love the ability to merge tokens but unless there's a faster way to do it (I really hope I'm just doing it wrong), I don't think it's usable on input of the size I'm working with.

Details

  • Combined length of the input docs ~829000 tokens
  • Python 2.7.6
  • Ubuntu 14.04 (yes, I know it's time to upgrade, I'm working on it)

Side notes:

It's even worse when I do NPs too (I assume because there are more of them), but I don't have the times because I didn't let it finish.
If I'm right, and this is O(n), you could get a tiny decrease in constant factors by starting from the end.

feat / ner help wanted help wanted (easy) perf / speed

All 15 comments

I bet there's some sort of "Shlemiel the painter" behaviour here. It's an easy mistake to make because the children aren't stored explicitly (as that would require N**2 space).

If you have time, could you have a look at the implementation in doc.merge() and see whether you can figure out what might be worse than O(n)? It would be something in the way the heads are calculated, for translating the heads.

Come to think of it: even if the merge implementation is O(n), the number of entities will also be O(n) --- so merging all the entities will be O(n**2). In other words, we probably need a bulk merge function.

I'll take a look at the implementation. And while in the worst case the number of entities would indeed be the same as the number of tokens, in practice (and I think that we can make an even stronger statement as we know that we are dealing with natural language) the number of entities is much smaller. In the case of these books it was less than 10000 entities for each (which actually seems small now that I think about it). In terms of Big-O though it would be O(n^2) if you're doing an in-place shift of an array which sounds pretty bad. I don't really know how the tokenizer works, but is there any chance that merging (or matching might be the appropriate API here) could run at tokenization time, I.e. before the doc is finished? Parsing and entity detection aren't going across sentence boundaries, right? And sentence boundaries are determined by the parser anyways, so maybe there could be a merge/match phase at the end of each sentence.

Is this behavior typical though? I'm not doing something wrong with the code? If that's the case, maybe I'll have to come up with some alternate approach. I don't really have the chapter break information for these texts, I could try a crude regex but that would be a bit of a pain, or I could just pick arbitrary page breaks, or maybe paralleling the merging using asynchronous threads or processes. In any case, I'll see what I can see in the implementation.

Is this, and all of the for loops after it, going over the current entity span or the entire doc?
https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L714
Looks like it goes over the whole doc.

That would definitely be O(n) for each merge operation, so looks like we're O(n^2) overall. Could reduce the constant factors by only looping once maybe. I think this needs a different approach though, like the bulk merge or changing the time at which these operations take place so that merging only occurs when spans are near the end of the document and only a small number of indices need to be moved.

That loops over the whole doc, yes. But it also does almost nothing — it should be just a C loop and incrementing a struct member.

Another suggestion: check the .is_parsed flag, and have a faster merge when no heads need to be adjusted?

I've thought about this a bit more and I think you're right that a bulk merge operation is the way to go. Even if those loops do almost nothing, merging all entities, or NPs, or anything else for that matter, in a Doc is O(n^2) on the length of the doc and as I'm working on docs of length ~10^5 that's still a lot of operations.

I didn't notice anything that should make each merge call any longer than O(n), but one other thing I did notice was the very last operation of clearing the cached Python objects allocates an array of None of length n for every merge. I have no idea how this works when it's compiled in Cython, but my inclination is to say that allocating a big array filled with Python objects on every merge call isn't great for performance either, even though the * operation for array allocation is pretty fast. Not sure there's anything that can be done about this step but I noticed it so I thought I'd bring it up.

As for your most recent suggestion, it sound like you're basically looking to add a faster code path for operations that don't require compressing the tokens array or adjusting the head pointers. I think that's a great suggestion and a good optimization but it still leaves the main bottleneck unaddressed. We have an O(n^2) operation and it's slow. I think that creating a bulk merge operation where all of the structs are adjusted and the tokens are compressed at once is the way to go. A bulk merge gets us to O(n) and that would fix the bottleneck. You'd create holes in the array, keep track of the hole indexes, and then loop from start to end and compress the holes all in one go. You could even do some of the other other struct updates and head pointer adjustments in the same loop I think, though that's just a constant time improvement. The only big concern I see is if you have overlaps of some kind and you end up having to do a bunch of checking offsets to make sure you don't end up with things pointing to the wrong place after the holes are compressed.

What do you think the best way to implement a bulk merge in the API would be? I still haven't actually used the merger API but that seems like it's doing a similar thing. Maybe that could be leveraged?

+1 — we need all of the above, but the most important is bulk merge.

It would be nice to switch doc.merge() to be the bulk merge operation. The single merge use-case is easily handled by Span.merge(). We can overload the API to start with, so that the deprecated usage is supported and warned, and then drop support for single-entity use via doc.merge().

This would give use:

Span.merge(): Calls Doc.merge(), with itself as the single case.
Doc.merge(): Does the actual merging, of N spans.

Can confirm. I been using PhraseMatcher with about ~2M entities over up to five sentences documents and the slowest part is merge function, which takes 1ms per call, according to cProfile. It's super fast if I disable merging here and do it manually (while working with docs as strings).

@honnibal I think the API you proposed makes sense. Can't wait for v2 :P

@sadovnychyi Can you explain what you did to merge entities manually as strings?

@ELind77 Now sure that this would be helpful for you, but it could be. My task is to simply replace spaces with underscores in specific phrases:

import spacy
nlp = spacy.load('en')
doc = nlp('Bla bla. Bla hello world. qwe.')
matcher = spacy.matcher.PhraseMatcher(
  nlp.vocab, [spacy.tokens.Doc(nlp.vocab, words=['hello', 'world'])])
matches = matcher(doc)
print(matches)
# [(13, 24, 'UH', 'hello world', 'MWE')]
matches_dict = {m[0]: m[1] for m in matches}
for sent in doc.sents:
  s = []
  merge_until = None
  for token in sent:
    match = matches_dict.get(token.idx)
    if match is not None:
      merge_until = match
      s.append(token.text.lower())
    elif merge_until is not None:
      s[-1] += ' %s' % token.text
      if (len(token.text) + token.idx) >= merge_until:
        merge_until = None
        s[-1] = s[-1].replace(' ', '_')
    else:
      s.append(token.text.lower())
  print(s)
# ['bla', 'bla', '.']
# ['bla', 'hello_world', '.']
# ['qwe', '.']

This is significantly faster than using doc merge function, but it only gives you plain strings.

@sadovnychyi thank you for sharing! I ended up doing this for the time being:

def yield_merged_ents(doc, attr='lemma_'):
    ent = ''
    for t in doc:
        # If we're in an ent append and continue
        if t.ent_iob_ == 'I':
            ent += '_%s' % getattr(t, attr)
            continue
        # If the current entity has ended, add it an clear
        if ent:
            yield ent
            ent = ''
        # start a new entity if needed
        if t.ent_iob_ == 'B':
            ent += getattr(t, attr)
            continue
        yield getattr(t, attr)
    # Clean up
    if ent:
        yield ent

-- Eric

It would be nice to get this resolved if someone has time — I think it should be a pretty nice self-contained task.

There's surely a way to do this in one pass, but it's late and I can't immediately see it. It might not even be faster, anyway. Here's a simple algorithm that does two passes and allocates O(n) space:

python def bulk_merge(self, spans): # 1. Track offset between before and after index offsets = [0] * len(self) offset = 0 for start, end in spans: for i in range(last_end, start): offsets[i] = offset offsets[start] = offset for i in range(start+1, end): offset += 1 # 2. Use the same array to mark deletions offsets[i] = -1 last_end = start # 3. Put all the tokens where they should be, and adjust the heads and IDs new_tokens = [None] * len(self) for i, token in enumerate(tokens): if offsets[i] != -1: token.i -= offsets[i] token.head -= offsets[i] new_tokens[i - offsets[i]] = token

Again...I haven't really thought this through. I think this should work though?

Hi @honnibal, I wanna take a shot at this, how can I start?

@ibrahimsharaf I'm not sure how to give a useful answer outside of the information already in the thread.

Maybe it would be useful to sketch out the algorithm in pure Python. Imagine you've got a list of elements like (index, text, head_offset), where index needs to be the position of the element within the list, and head needs to indicate another token. You can make the tokens objects if that's easier for you.

We want a bulk merge operation that doesn't make a nested loop over the tokens, i.e. that run in linear time with respect to the number of words.

Ok, I'm starting to make use of merging on large documents again and my cython is (a tiny bit) better so I'm back to bother you again.

Your proposed implementation above looks ok except for one problem. What if there is another merge in between the current merge and the head? To fix that, for the case where the head is after the current index we would need to look ahead to the offset at the index of the head.

Also, we could save on some space by not creating a new tokens array but instead compressing the existing one and realloc it if that's possible (and not dangerous or would cause thread-safety issues for some reason).

Another possible memory optimization for very large documents (like books) that might be doable would be to have the offsets array be fixed size (around 100) and then clear it in between sentences and store an offset_so_far. This should be doable if head pointers don't cross sentence boundaries and we don't get sentences longer than 100 tokens (which Ii think are reasonable assumptions?).

So, to implement this change I need some info:

  1. Should this be implented as a method of the Doc class?
  2. Where does the actual merging of putting the information of the merged tokens into the head of the span occur?
  3. Do I need to delete anything to make this work?
  4. What should the API of bulk_merge be? Where is it getting the tokens variable from?
  5. Are there already tests in place that will validate what I implement or do I need to write new tests? If some other internal method is going to call this it would be great to see what you think a functional test of the calling method should look like.

Something like this, based on your implementation above:

def bulk_merge(self, spans, tokens):
    # 1. Track offset between before and after index
    offsets = [0] * len(self)
    cdef int offset = 0
    cdef int i

    for start, end in spans:
        # Catch up the offsets to current position
        for i in range(last_end, start):
            offsets[i] = offset
        offsets[start] = offset

        for i in range(start+1, end):
            offset += 1
            # 2. Use the same array to mark deletions
            offsets[i] = -1
        last_end = start

    # 3. Put all the tokens where they should be, and adjust the heads and IDs
    new_tokens = [None] * len(self)
    for i, token in enumerate(tokens):
        if offsets[i] != -1:
            # Adjust the dependency head
            if token.head > token.i:
                token.head -= offsets[tokens[token.head.i]]
            else:
                token.head -= offsets[i]
            # Adjust token index
            token.i -= offsets[i]
            # Put the token into it's new spot
            new_tokens[i - offsets[i]] = token
    return new_tokens

I can't promise I'll actually make progress on this, but I'll try.

-- Eric

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings