For convenience and compactness, it would be great to write and read multiple Doc objects to/from the same file. Back in the days of spacy<2.0, this worked! Nowadays, such writes still work in v2.0+ —
import io
import spacy
import srsly
nlp = spacy.load("en")
docs = [
nlp("This is the first document."),
nlp("This is another document.")
]
fname = "/Users/burtondewilde/Desktop/test-spacy-docs-io.bin"
with io.open(fname, mode="wb") as f:
for doc in docs:
f.write(doc.to_bytes())
— but reads do not:
with io.open(fname, mode="rb") as f:
for line in f:
new_doc = spacy.tokens.Doc(nlp.vocab).from_bytes(line)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-5dffc40037a7> in <module>
1 with io.open(fname, mode="rb") as f:
2 for line in f:
----> 3 new_doc = spacy.tokens.Doc(nlp.vocab).from_bytes(line)
doc.pyx in spacy.tokens.doc.Doc.from_bytes()
~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/spacy/util.py in from_bytes(bytes_data, setters, exclude)
585
586 def from_bytes(bytes_data, setters, exclude):
--> 587 msg = srsly.msgpack_loads(bytes_data)
588 for key, setter in setters.items():
589 # Split to support file names like meta.json
~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/srsly/_msgpack_api.py in msgpack_loads(data, use_list)
27 # msgpack-python docs suggest disabling gc before unpacking large messages
28 gc.disable()
---> 29 msg = msgpack.loads(data, raw=False, use_list=use_list)
30 gc.enable()
31 return msg
~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/srsly/msgpack/__init__.py in unpackb(packed, **kwargs)
58 object_hook = kwargs.get('object_hook')
59 kwargs['object_hook'] = functools.partial(_decode_numpy, chain=object_hook)
---> 60 return _unpackb(packed, **kwargs)
61
62
_unpacker.pyx in srsly.msgpack._unpacker.unpackb()
ValueError: Unpack failed: incomplete input
It _is_ possible to read data one doc at a time using msgpack directly, but it comes in as dicts instead of bytes, which doesn't play nicely with Doc.from_bytes():
with io.open(fname, mode="rb") as f:
unpacker = srsly.msgpack.Unpacker(f, raw=False, unicode_errors="strict")
for msg in unpacker:
new_doc = spacy.tokens.Doc(nlp.vocab).from_bytes(msg)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-f74bc6213f51> in <module>
3 for msg in unpacker:
4 print(msg)
----> 5 new_doc = spacy.tokens.Doc(nlp.vocab).from_bytes(msg)
doc.pyx in spacy.tokens.doc.Doc.from_bytes()
~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/spacy/util.py in from_bytes(bytes_data, setters, exclude)
585
586 def from_bytes(bytes_data, setters, exclude):
--> 587 msg = srsly.msgpack_loads(bytes_data)
588 for key, setter in setters.items():
589 # Split to support file names like meta.json
~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/srsly/_msgpack_api.py in msgpack_loads(data, use_list)
27 # msgpack-python docs suggest disabling gc before unpacking large messages
28 gc.disable()
---> 29 msg = msgpack.loads(data, raw=False, use_list=use_list)
30 gc.enable()
31 return msg
~/.pyenv/versions/3.7.0/envs/textacy-py3/lib/python3.7/site-packages/srsly/msgpack/__init__.py in unpackb(packed, **kwargs)
58 object_hook = kwargs.get('object_hook')
59 kwargs['object_hook'] = functools.partial(_decode_numpy, chain=object_hook)
---> 60 return _unpackb(packed, **kwargs)
61
62
_unpacker.pyx in srsly.msgpack._unpacker.unpackb()
_unpacker.pyx in srsly.msgpack._unpacker.get_data_from_buffer()
TypeError: a bytes-like object is required, not 'dict'
By porting over some functionality in spacy.tokens.doc.Doc.from_bytes(), textacy has hacked together a work-around, but it's brittle and not great.
Is there a better way to do this? And if so, should it be included in spacy itself?
Hey,
It's not documented yet because it's still pending name change and refinement, but check out https://github.com/explosion/spaCy/blob/master/spacy/tokens/_serialize.py
Hey Matt, thanks for the pointer — I saw this a while back then completely forgot about it. 😅 This seems like a pretty reasonable way to go about it, but my only big concern is that you have to read the full dataset into memory here before streaming in the reconstructed docs here. It's probably not a blocker for most use cases — I'll check that when I have a chance — but a way to hook into spacy's Doc.from_bytes() code using msgpack.Unpacker could still be great...
That's a good point. Maybe we could also have a header file that provides the byte-offsets at which documents start, along with their lengths?
A belated update: I've implemented a version of your Binder solution in a development branch of textacy for de/serializing a collection of spacy docs: https://github.com/chartbeat-labs/textacy/blob/feature/update-corpus-and-doc/textacy/corpus.py#L349-L439. It seems to work pretty well, although there's the unresolved question of how best to handle user_data attached to the docs themselves.
@bdewilde Awesome!
For the user_data, perhaps you could support custom serializer functions for that. Some suggestions for built-in options:
user_data (i.e. null serializer)msgpack serializable, use msgpack, otherwise use Pickle.Perhaps this'd be a good candidate for upstreaming into the main library, once you're happy with it?
Hey! Here's how it ended up. I used msgpack to dump the user_data without special handling, in hopes that users aren't putting anything exotic into their Doc metadata. I get the sense that there's probably a better way, but this is good enough for now. Definitely better than what I was doing before!
And yes, it would be amazing to have an official spaCy implementation of this sort of thing. I'm currently working on better integrating textacy _into_ spaCy rather than working around it, so the more functionality I can borrow and/or build upon from spaCy, the better. :)
We just released spaCy v2.2 with the new DocBin class for efficient binary serialization of Doc objects!
Details: https://spacy.io/usage/saving-loading#docs
API: https://spacy.io/api/docbin
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
We just released spaCy v2.2 with the new
DocBinclass for efficient binary serialization ofDocobjects!Details: https://spacy.io/usage/saving-loading#docs
API: https://spacy.io/api/docbin