Zstd: Random access

Created on 8 Oct 2016 · 12Comments · Source: facebook/zstd

Hello !
Does someone already started something like this http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files/ ?
I'm looking for a way to extract partial content of compressed files and it seems that the zstd frames and user frames could be used to somehow add index and decompress only parts of it.

Use case: We have several text files of originally 20MB compressed to 4MB and want to retrieve random pieces of around 20KB to 200KB decompressed.

Ideally something like:
ZSTDLIB_API size_t ZSTD_decompress_piece(
void* dst, size_t dstCapacity,
const void* src, size_t compressedSize,
size_t uncompressed_start_pos,
size_t uncompressed_size);

Cheers !

duplicate enhancement

Source

mingodad

👍2

Most helpful comment

You can also split a file in small chunks, train a dictionary on the chunks and compress/decompress them independently. You get random-ish access while keeping relatively good compression ratio.

gyscos on 8 Oct 2016

👍3

All 12 comments

See issue #395. You can break your data up into small chunks and compress each chunk independently. Then you concatenate them all and at the beginning put a table that points to where each frame starts inside a skippable frame.

We are considering adding table like this to the standard by adopting the skippable frame that pzstd uses.

terrelln on 8 Oct 2016

You can also split a file in small chunks, train a dictionary on the chunks and compress/decompress them independently. You get random-ish access while keeping relatively good compression ratio.

gyscos on 8 Oct 2016

👍3

Then you concatenate them all and at the beginning put a table that points to where each frame starts inside a skippable frame.

@terrelln Can it be put at the ending of zstd file? For hadoop like filesystems(Append-only), it will be much easier and more efficient to put metadata in the end as we can just write once from the beginning to the ending.

advancedxy on 9 Oct 2016

In terms of the format, it can be put anywhere. However, its only useful if you can use it when you are decompressing, so it has to be somewhere the decompresser can use it. Perhaps it can be put in a separate file? Putting it at the end of the file is the easiest to writing, but is there a way to find out where it is and read it? Any input here is very valuable.

If I were writing an API to read/write these tables, I might give the user a choice of where it goes. So pass out separate buffers for the table and the data during compression, and require the user give us the table and the data for decompression.

terrelln on 9 Oct 2016

Perhaps it can be put in a separate file?

Yes, but it creates small files as metadata is usually small.

but is there a way to find out where it is and read it?

Yes. Add the metadata footer size at the very ending of the footer(something like footerSize + magicNumber + contentHash), we can then determine the footer position. You can refer to the 4mc spec. Of course, it's just one idea.

Looks like we are off topic, let's continue the conversion when you and Cyan4973 are going to update the standard.

advancedxy on 9 Oct 2016

Sounds good. I haven't read the 4mc spec yet, I'll make sure to check it out this week.

terrelln on 9 Oct 2016

For what is worth, C-Blosc has support for exactly this scenario via its blosc_getitem() API. Basically what Blosc does is to split the chunk to be compressed into smaller blocks (from 8 KB up to 2 MB, depending on the compression level and codec used) + an index to them; with this, blosc_getitem() will only decompress those blocks that contain the data between [start and start+nitems). This, combined with some internal caches is the data container using Blosc, leads to important speed-ups in reading small excerpts of data.

Blosc is a meta-compressor which supports Zstd as one of its codecs. It is used is a series of data containers for storing data in compressed chunks (e.g. bcolz, bloscpack, zarr or HDF5 and HDF5 for Julia].

FrancescAlted on 9 Oct 2016

Hello !
I've been looking at the FrancescAlted suggestion of C-Blosc but could not find a way to do it transparently. I mean have an api where we can start compressing data and add more data till we call the end of compression, like we do with zlib.
Something like (pseudo code):

//Compressing
int initialize_compression_context(ctx, more parameters here);
int start_compression_context(ctx);
save_compression_context_header(ctx);
while(data_to_compress)
{
     src = read_create_next_chunk_of_data();
     compress_one_block(ctx, src, src_len, dest, dest_len);
     save_compressed_block(dest, dest_len);
}
end_compression_context(ctx);
save_compression_context_footer(ctx);

//UnCompressing all
int initialize_compression_context(ctx, more parameters here);
load_compression_context_header(ctx, src_header);
while(data_to_uncompress)
{
     src = read_next_chunk_of_compressed_data();
     uncompress_block(ctx, src, src_len, dest, dest_len);
     save_uncompressed_block(dest, dest_len);
}

//Random access UnCompressing
uncompressed_offset_wanted = 12890;
uncompressed_size_wanted = 20000;

int initialize_compression_context(ctx, more parameters here);
load_compression_context_header(ctx, src_header);
compressed_offset = get_compressed_offset(ctx, uncompressed_offset_wanted);
compress_seek(src_storeage, compressed_offset);

block_offset_adjusted = 0;
total_data_saved = 0;
need_block_offset_adjust = true;

while(data_to_uncompress)
{
     block_compressed = read_next_chunk_of_compressed_data(src_storeage);
     uncompress_block(compression_context ctx, block_compressed, block_compressed_len, dest, dest_len);
     if(need_block_offset_adjust)
     {
        uncompressed_start_offset_of_this_bloc = get_uncompressed_offset(ctx, block_compressed);
        block_offset_adjusted = uncompressed_offset_wanted - uncompressed_start_offset_of_this_bloc;
        need_block_offset_adjust = false;
     }
     else block_offset_adjusted = 0;
     data_size_to_save = dest_len - block_offset_adjusted;
     if( (total_data_saved + data_size_to_save) > uncompressed_size_wanted)
                 data_size_to_save = uncompressed_size_wanted - total_data_saved;
     save_uncompressed_block(dest + block_offset_adjusted, data_size_to_save);
     total_data_saved += data_size_to_save;
     if(total_data_saved == uncompressed_size_wanted) break;
}

mingodad on 9 Oct 2016

With Blosc you don't need to pass data in small blocks; just pass the whole chunk and Blosc will split and index the data for you. You can see an blosc_getitem() example here. Also, you may want to have a look at the C-Blosc simple workflow that takes to create compressed data chunks.

FrancescAlted on 9 Oct 2016

Hello FrancescAlted !
For some cases that's fine but if we have more data than available ram memory ?
I think that both in memory and file like based chunks should be available.
Cheers !

mingodad on 9 Oct 2016

Fair enough. As said, C-Blosc is mainly meant to be used in libraries (like the ones mentioned before) that provide this additional layer of (chunked) persistency. On his hand, C-Blosc2 is meant to address this by providing the concept of super-chunks, where chunks can be added and retrieved to/from the super-chunk at anytime. However, Blosc2 is mostly work in progress, and in particular, it does not implement the file abstraction yet (although this should not be too far away).

FrancescAlted on 10 Oct 2016

This is now discussed in #395