Spacy: writing/reading multiple docs to/from same file in binary format

Created on 22 Mar 2019  Â·  8Comments  Â·  Source: explosion/spaCy

Feature description

For convenience and compactness, it would be great to write and read multiple Doc objects to/from the same file. Back in the days of spacy<2.0, this worked! Nowadays, such writes still work in v2.0+ —

import io
import spacy
import srsly

nlp = spacy.load("en")
docs = [
    nlp("This is the first document."),
    nlp("This is another document.")
]

fname = "/Users/burtondewilde/Desktop/test-spacy-docs-io.bin"
with io.open(fname, mode="wb") as f:
    for doc in docs:
        f.write(doc.to_bytes())

— but reads do not:

with io.open(fname, mode="rb") as f:
    for line in f:
        new_doc = spacy.tokens.Doc(nlp.vocab).from_bytes(line)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-5dffc40037a7> in <module>
      1 with io.open(fname, mode="rb") as f:
      2     for line in f:
----> 3         new_doc = spacy.tokens.Doc(nlp.vocab).from_bytes(line)

doc.pyx in spacy.tokens.doc.Doc.from_bytes()

~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/spacy/util.py in from_bytes(bytes_data, setters, exclude)
    585 
    586 def from_bytes(bytes_data, setters, exclude):
--> 587     msg = srsly.msgpack_loads(bytes_data)
    588     for key, setter in setters.items():
    589         # Split to support file names like meta.json

~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/srsly/_msgpack_api.py in msgpack_loads(data, use_list)
     27     # msgpack-python docs suggest disabling gc before unpacking large messages
     28     gc.disable()
---> 29     msg = msgpack.loads(data, raw=False, use_list=use_list)
     30     gc.enable()
     31     return msg

~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/srsly/msgpack/__init__.py in unpackb(packed, **kwargs)
     58         object_hook = kwargs.get('object_hook')
     59         kwargs['object_hook'] = functools.partial(_decode_numpy, chain=object_hook)
---> 60     return _unpackb(packed, **kwargs)
     61 
     62 

_unpacker.pyx in srsly.msgpack._unpacker.unpackb()

ValueError: Unpack failed: incomplete input

It _is_ possible to read data one doc at a time using msgpack directly, but it comes in as dicts instead of bytes, which doesn't play nicely with Doc.from_bytes():

with io.open(fname, mode="rb") as f:
    unpacker = srsly.msgpack.Unpacker(f, raw=False, unicode_errors="strict")
    for msg in unpacker:
        new_doc = spacy.tokens.Doc(nlp.vocab).from_bytes(msg)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-52-f74bc6213f51> in <module>
      3     for msg in unpacker:
      4         print(msg)
----> 5         new_doc = spacy.tokens.Doc(nlp.vocab).from_bytes(msg)

doc.pyx in spacy.tokens.doc.Doc.from_bytes()

~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/spacy/util.py in from_bytes(bytes_data, setters, exclude)
    585 
    586 def from_bytes(bytes_data, setters, exclude):
--> 587     msg = srsly.msgpack_loads(bytes_data)
    588     for key, setter in setters.items():
    589         # Split to support file names like meta.json

~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/srsly/_msgpack_api.py in msgpack_loads(data, use_list)
     27     # msgpack-python docs suggest disabling gc before unpacking large messages
     28     gc.disable()
---> 29     msg = msgpack.loads(data, raw=False, use_list=use_list)
     30     gc.enable()
     31     return msg

~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/srsly/msgpack/__init__.py in unpackb(packed, **kwargs)
     58         object_hook = kwargs.get('object_hook')
     59         kwargs['object_hook'] = functools.partial(_decode_numpy, chain=object_hook)
---> 60     return _unpackb(packed, **kwargs)
     61 
     62 

_unpacker.pyx in srsly.msgpack._unpacker.unpackb()

_unpacker.pyx in srsly.msgpack._unpacker.get_data_from_buffer()

TypeError: a bytes-like object is required, not 'dict'

By porting over some functionality in spacy.tokens.doc.Doc.from_bytes(), textacy has hacked together a work-around, but it's brittle and not great.

Is there a better way to do this? And if so, should it be included in spacy itself?

enhancement feat / serialize

Most helpful comment

We just released spaCy v2.2 with the new DocBin class for efficient binary serialization of Doc objects!

Details: https://spacy.io/usage/saving-loading#docs
API: https://spacy.io/api/docbin

All 8 comments

Hey,

It's not documented yet because it's still pending name change and refinement, but check out https://github.com/explosion/spaCy/blob/master/spacy/tokens/_serialize.py

Hey Matt, thanks for the pointer — I saw this a while back then completely forgot about it. 😅 This seems like a pretty reasonable way to go about it, but my only big concern is that you have to read the full dataset into memory here before streaming in the reconstructed docs here. It's probably not a blocker for most use cases — I'll check that when I have a chance — but a way to hook into spacy's Doc.from_bytes() code using msgpack.Unpacker could still be great...

That's a good point. Maybe we could also have a header file that provides the byte-offsets at which documents start, along with their lengths?

A belated update: I've implemented a version of your Binder solution in a development branch of textacy for de/serializing a collection of spacy docs: https://github.com/chartbeat-labs/textacy/blob/feature/update-corpus-and-doc/textacy/corpus.py#L349-L439. It seems to work pretty well, although there's the unresolved question of how best to handle user_data attached to the docs themselves.

@bdewilde Awesome!

For the user_data, perhaps you could support custom serializer functions for that. Some suggestions for built-in options:

  • Ignore user_data (i.e. null serializer)
  • Serialize to msgpack if possible, raise error if not.
  • If a value is msgpack serializable, use msgpack, otherwise use Pickle.
  • Just use pickle

Perhaps this'd be a good candidate for upstreaming into the main library, once you're happy with it?

Hey! Here's how it ended up. I used msgpack to dump the user_data without special handling, in hopes that users aren't putting anything exotic into their Doc metadata. I get the sense that there's probably a better way, but this is good enough for now. Definitely better than what I was doing before!

And yes, it would be amazing to have an official spaCy implementation of this sort of thing. I'm currently working on better integrating textacy _into_ spaCy rather than working around it, so the more functionality I can borrow and/or build upon from spaCy, the better. :)

We just released spaCy v2.2 with the new DocBin class for efficient binary serialization of Doc objects!

Details: https://spacy.io/usage/saving-loading#docs
API: https://spacy.io/api/docbin

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings