Something along the lines of what fairseq has:
Here
For example, in MMapIndexedDataset, they calculate the sizes for each field and then read exactly that size into an array.
Here, @matt-gardner mentions that this is the type of caching that you guys might implement. A database that just stores tensors directly.
There is some initial work along these lines here (used in a dataset reader here). I think we definitely want to figure out how to include this in our main Instance creation pipeline, but that'll come a little bit later. If you're interested in helping to design / work on this, contributions welcome. This is a big enough piece of work that some kind of design document (or detailed github description) should come before any code is written.
@epwalsh, do we have an issue open already for this? If not, I'll keep this open and add it to either the 2.0 or the performance milestone (maybe 2.0?).
@matt-gardner I don't think we had a separate issue tracking caching until now, so adding this to 2.0 sounds good to me.
Should I create a .md page or a Google Docs?
Anyway, most of the logic will use the to_tensor method of each field, right?
For fields that don't have to_tensor, like MetadataField, use jsonpickle, and write that into the file.
Regarding the way we serialize the tensors, we can either:
I am leaning towards (2) since we are reading the instances in memory sequentially, which might be better than a key:value solution like lmdb.
Hey @OhadRubin, I've already started a design document which I'll link to on this thread once it's ready to be seen.
There is a good discussion here about lmdb: https://github.com/allenai/allennlp/pull/4578#issuecomment-680776842.
I'm working on an API design that would be agnostic to the backend or ser/deserialization method we use, so we can decide on that later.
So it seems there are different use cases for caching, maybe I can write some code that extends TensorCache into a Registrable and implement a MMapIndexedDataset style TensorCache.
@epwalsh, what do you think?
@OhadRubin that would be great if you got started on that!
Right now just I'm focusing on the overall API, and how that would integrate into our data pipeline, so I don't think that collides with what you want to work on. Once I'm finished with the API skeleton we should be able to plug in the TensorCache / MMapIndexedDataset / whatever we go with.
So I should inherit from DatasetReader and override _instances_to_cache_file, correct?
I understand there are plans to move caching to DataLoader, but for now, that should be ok, right?
@OhadRubin actually no, we're working off of the vision branch now, which will become AllenNLP 2.0. In that branch the caching mechanism in the DatasetReader has been removed. Maybe just start by implementing this in a standalone file?
Ok, so I'll assume I am iterating over objects with a serialize method, is that ok?
Yes, but we can assume these objects are indexed Instances.
Hey, @epwalsh, after looking a bit more into MMapIndexedDataset, much of the benefit of it comes from knowing the structure and sizes of everything, can I assume serialize returns a Dict[str, Union[numpy.array, str]]?
Hey @OhadRubin, well after the talking with the team a little more yesterday we decided it would probably be more beneficial to cache tensor dicts instead of actual Instance objects. In other words, we want to be caching the result of Instance.as_tensor_dict().
So in most cases, this is a Dict[str, torch.Tensor], which would play nicely with the MMapIndexedDataset approach. However, there are some exceptions. In particular, when an Instance contains a MetaDataField, Instance.as_dict_dict() could contain pretty much anything.
There is a overhead of around 760 bytes for saving a Tensor with torch.save, compared to numpy.tobytes.
So i'll use numpy.
import torch
import io
import pickle
import numpy as np
for dtype in [torch.bool, torch.float16,torch.float32,torch.int16,torch.int32,torch.int64]:
print(dtype)
for size in range(2,5):
in_list = list(range(10**size))
torch_tensor = torch.tensor(in_list,dtype=dtype).detach()
buffer = io.BytesIO()
torch.save(torch_tensor, buffer, pickle_protocol=pickle.HIGHEST_PROTOCOL)
torch_res = buffer.getbuffer().tobytes()
np_res = torch_tensor.numpy().tobytes(order='C')
print(f"size: {10**size} - torch_res: {len(torch_res)}, np_res: {len(np_res)}")
Output:
torch.bool
size: 100 - torch_res: 824, np_res: 100
size: 1000 - torch_res: 1720, np_res: 1000
size: 10000 - torch_res: 10744, np_res: 10000
torch.float16
size: 100 - torch_res: 952, np_res: 200
size: 1000 - torch_res: 2744, np_res: 2000
size: 10000 - torch_res: 20728, np_res: 20000
torch.float32
size: 100 - torch_res: 1144, np_res: 400
size: 1000 - torch_res: 4728, np_res: 4000
size: 10000 - torch_res: 40760, np_res: 40000
torch.int16
size: 100 - torch_res: 952, np_res: 200
size: 1000 - torch_res: 2744, np_res: 2000
size: 10000 - torch_res: 20728, np_res: 20000
torch.int32
size: 100 - torch_res: 1144, np_res: 400
size: 1000 - torch_res: 4728, np_res: 4000
size: 10000 - torch_res: 40760, np_res: 40000
torch.int64
size: 100 - torch_res: 1528, np_res: 800
size: 1000 - torch_res: 8760, np_res: 8000
size: 10000 - torch_res: 80760, np_res: 80000
Edit:
@dirkgr, for images this is not negligible correct? The image features are 2000 x float32, so it is a 10% improvement in space.
And since torch.from_numpy and Tensor.numpy() are zero-copy operation, you get 10% less space usage for free.
API design document (work in progress): https://docs.google.com/document/d/1EBAcPF19NM7bYuwDKmHN4Ws361p0Eo4vjIpSA67l1rQ/edit?usp=sharing
numpy.tobytes() does not remember the data type, size, or value order of the tensor. There might be other stuff. torch.save() does, and we need that stuff to be preserved.
I would consider using numpy.tobytes() and torch.from_numpy() anyways because it allows us to create a torch tensor that's backed by memory-mapped bytes, no reading required. But we have some other hurdles to clear before we can do that. So for now, let's stick with torch.save().