Allennlp: Better caching

Created on 24 Aug 2020 · 16Comments · Source: allenai/allennlp

Something along the lines of what fairseq has:
Here
For example, in MMapIndexedDataset, they calculate the sizes for each field and then read exactly that size into an array.
Here, @matt-gardner mentions that this is the type of caching that you guys might implement. A database that just stores tensors directly.

Feature request Under Development

Source

OhadRubin

👍2

All 16 comments

There is some initial work along these lines here (used in a dataset reader here). I think we definitely want to figure out how to include this in our main Instance creation pipeline, but that'll come a little bit later. If you're interested in helping to design / work on this, contributions welcome. This is a big enough piece of work that some kind of design document (or detailed github description) should come before any code is written.

@epwalsh, do we have an issue open already for this? If not, I'll keep this open and add it to either the 2.0 or the performance milestone (maybe 2.0?).

matt-gardner on 24 Aug 2020

@matt-gardner I don't think we had a separate issue tracking caching until now, so adding this to 2.0 sounds good to me.

epwalsh on 24 Aug 2020

Should I create a .md page or a Google Docs?
Anyway, most of the logic will use the to_tensor method of each field, right?
For fields that don't have to_tensor, like MetadataField, use jsonpickle, and write that into the file.
Regarding the way we serialize the tensors, we can either:

use TensorCache, which uses lmdb.
use something similar to MMapIndexedDataset, where each tensor converted into a numpy array is and be written/read to a file sequentially.

I am leaning towards (2) since we are reading the instances in memory sequentially, which might be better than a key:value solution like lmdb.

OhadRubin on 25 Aug 2020

Hey @OhadRubin, I've already started a design document which I'll link to on this thread once it's ready to be seen.

epwalsh on 25 Aug 2020

👍1

There is a good discussion here about lmdb: https://github.com/allenai/allennlp/pull/4578#issuecomment-680776842.

I'm working on an API design that would be agnostic to the backend or ser/deserialization method we use, so we can decide on that later.

epwalsh on 26 Aug 2020

So it seems there are different use cases for caching, maybe I can write some code that extends TensorCache into a Registrable and implement a MMapIndexedDataset style TensorCache.
@epwalsh, what do you think?

OhadRubin on 26 Aug 2020

@OhadRubin that would be great if you got started on that!

Right now just I'm focusing on the overall API, and how that would integrate into our data pipeline, so I don't think that collides with what you want to work on. Once I'm finished with the API skeleton we should be able to plug in the TensorCache / MMapIndexedDataset / whatever we go with.

epwalsh on 26 Aug 2020

So I should inherit from DatasetReader and override _instances_to_cache_file, correct?
I understand there are plans to move caching to DataLoader, but for now, that should be ok, right?

OhadRubin on 26 Aug 2020

@OhadRubin actually no, we're working off of the vision branch now, which will become AllenNLP 2.0. In that branch the caching mechanism in the DatasetReader has been removed. Maybe just start by implementing this in a standalone file?

epwalsh on 26 Aug 2020

Ok, so I'll assume I am iterating over objects with a serialize method, is that ok?

OhadRubin on 26 Aug 2020

Yes, ~~but we can assume these objects are indexed Instances.~~

epwalsh on 26 Aug 2020

Hey, @epwalsh, after looking a bit more into MMapIndexedDataset, much of the benefit of it comes from knowing the structure and sizes of everything, can I assume serialize returns a Dict[str, Union[numpy.array, str]]?

OhadRubin on 27 Aug 2020

Hey @OhadRubin, well after the talking with the team a little more yesterday we decided it would probably be more beneficial to cache tensor dicts instead of actual Instance objects. In other words, we want to be caching the result of Instance.as_tensor_dict().

So in most cases, this is a Dict[str, torch.Tensor], which would play nicely with the MMapIndexedDataset approach. However, there are some exceptions. In particular, when an Instance contains a MetaDataField, Instance.as_dict_dict() could contain pretty much anything.

epwalsh on 27 Aug 2020

👍1

There is a overhead of around 760 bytes for saving a Tensor with torch.save, compared to numpy.tobytes.
So i'll use numpy.

import torch
import io
import pickle
import numpy as np

for dtype in [torch.bool, torch.float16,torch.float32,torch.int16,torch.int32,torch.int64]:
    print(dtype)
    for size in range(2,5):
        in_list = list(range(10**size))
        torch_tensor = torch.tensor(in_list,dtype=dtype).detach()
        buffer = io.BytesIO()
        torch.save(torch_tensor, buffer, pickle_protocol=pickle.HIGHEST_PROTOCOL)
        torch_res = buffer.getbuffer().tobytes()
        np_res = torch_tensor.numpy().tobytes(order='C')
        print(f"size: {10**size} - torch_res: {len(torch_res)}, np_res: {len(np_res)}")

Output:

torch.bool
size: 100 - torch_res: 824, np_res: 100
size: 1000 - torch_res: 1720, np_res: 1000
size: 10000 - torch_res: 10744, np_res: 10000
torch.float16
size: 100 - torch_res: 952, np_res: 200
size: 1000 - torch_res: 2744, np_res: 2000
size: 10000 - torch_res: 20728, np_res: 20000
torch.float32
size: 100 - torch_res: 1144, np_res: 400
size: 1000 - torch_res: 4728, np_res: 4000
size: 10000 - torch_res: 40760, np_res: 40000
torch.int16
size: 100 - torch_res: 952, np_res: 200
size: 1000 - torch_res: 2744, np_res: 2000
size: 10000 - torch_res: 20728, np_res: 20000
torch.int32
size: 100 - torch_res: 1144, np_res: 400
size: 1000 - torch_res: 4728, np_res: 4000
size: 10000 - torch_res: 40760, np_res: 40000
torch.int64
size: 100 - torch_res: 1528, np_res: 800
size: 1000 - torch_res: 8760, np_res: 8000
size: 10000 - torch_res: 80760, np_res: 80000

Edit:
@dirkgr, for images this is not negligible correct? The image features are 2000 x float32, so it is a 10% improvement in space.
And since torch.from_numpy and Tensor.numpy() are zero-copy operation, you get 10% less space usage for free.

OhadRubin on 27 Aug 2020

👀1

API design document (work in progress): https://docs.google.com/document/d/1EBAcPF19NM7bYuwDKmHN4Ws361p0Eo4vjIpSA67l1rQ/edit?usp=sharing

epwalsh on 27 Aug 2020

numpy.tobytes() does not remember the data type, size, or value order of the tensor. There might be other stuff. torch.save() does, and we need that stuff to be preserved.

I would consider using numpy.tobytes() and torch.from_numpy() anyways because it allows us to create a torch tensor that's backed by memory-mapped bytes, no reading required. But we have some other hurdles to clear before we can do that. So for now, let's stick with torch.save().