Zstd: Reusing context for compression

Created on 5 May 2020 · 22Comments · Source: facebook/zstd

This is more of a query than an issue. I am trying to use ZSTD_compressCCtx() to be more memory efficient. I am allocating the context and initializing it as a static context per-cpu basis. At the time of write IO, I am using ZSTD_compressCCtx() and passing the per-cpu zstd-context. After sometime, I am seeing that the api return memory_allocation error. I am not sure why this is happening.

In fuzzer unit test, I noticed that the api is preceeded by ZSTD_compressBegin(). After using this, I didnt see any error. But why is this required even for non-streaming compression? I hope we dont require to initialize the ctx before every compression.

Static context allocation:

void xxx_allocate_zstd_mem(ZSTD_CCtx **zstd_comp_wrkmem,
                            ZSTD_DCtx **zstd_decomp_wrkmem)
{
        size_t wrkmem_size = 0;
        void *wrkmem = NULL;
        wrkmem_size = ZSTD_estimateCCtxSize(xxx_zstd_cmpr_level);
        wrkmem = xxx_mem_alloc(wrkmem_size);
        *zstd_comp_wrkmem = ZSTD_initStaticCCtx(wrkmem, wrkmem_size);

        wrkmem_size = ZSTD_estimateDCtxSize();
        wrkmem = xxx_mem_alloc(wrkmem_size);
        *zstd_decomp_wrkmem = ZSTD_initStaticDCtx(wrkmem, wrkmem_size);
}

zstd compression using context:

xxx_zstd_compress(<>)
{
        size_t out_bound = 0;
        size_t c_len = 0;
        ZSTD_CCtx *zstd_wrkmem = xxx_pcpu_mem.zstd_comp_wrkmem;

        out_bound = ZSTD_compressBound(len_in);
        c_len = ZSTD_compressBegin(zstd_wrkmem, wafl_zstd_cmpr_level);
        if (ZSTD_isError(c_len)) {
                return Z_ERRNO;
        }
        c_len = ZSTD_compressCCtx(zstd_wrkmem,
                                  out, out_bound,
                                  in, len_in,
                                  xxx_zstd_cmpr_level);
        if (ZSTD_isError(c_len)) {
                return Z_ERRNO;
        }
          return Z_OK;
}

Thanks!

question

Source

nand28

All 22 comments

1) ZSTD_compressBegin() is only required for streaming buffer-less mode. There's no need for it when compressing in one-shot mode, which is what ZSTD_compressCCtx() is doing.

2) ZSTD_estimateCCtxSize() provides the memory budget for one-shot mode. Note that there is also ZSTD_estimateCStreamSize(), which provides the memory budget for streaming and buffered mode. Both values are different.

3) By invoking ZSTD_compressBegin(), you are requesting a memory budget suitable for any input size, which is higher than the memory budget reserved for one-shot mode on a known input size. This can result in memory allocation ajustments.

From zstd.h documentation :

 *  ZSTD_estimateCCtxSize() will provide a budget large enough for any
 *  compression level up to selected one. Unlike ZSTD_estimateCStreamSize*(),
 *  this estimate does not include space for a window buffer, so this estimate
 *  is guaranteed to be enough for single-shot compressions, but not streaming
 *  compressions.

Note that advanced functions like ZSTD_initStaticCCtx() exist for platforms with tight memory constraints. As far as performance is concerned, this method is unlikely to fare much differently from the more usual and recommended ZSTD_createCCtx().

Cyan4973 on 5 May 2020

Thanks for the quick response. I encountered memory allocation error without using ZSTD_compressBegin(). Only with usage of it before invocation of ZSTD_compressCCtx(), I got memory allocation error. Note this is for one-shot compression and not streaming mode, with input buffer size not exceeding 32KB.

Also, the reason for using static context allocation is that, we don't want ZSTD_compressCCtx to dynamically allocate memory. It doesn't do so only if it is statically allocated ctx.

Is it possible to pass ZSTD_createCCtx() custom memalloc function and mention as static allocation using this api?

nand28 on 6 May 2020

Is it possible to pass ZSTD_createCCtx() custom memalloc function

Yes. One can use ZSTD_createCCtx_advanced(). The allocation policy is then entirely controlled by the custom allocator.

I encountered memory allocation error without using ZSTD_compressBegin().

This is weird. If I understand your use case correctly, it should have worked. I would like to study a reproduction case then.

Cyan4973 on 6 May 2020

Under what circumstances memory will be allocated dynamically in ZSTD_compressCCtx()? Even if I use dynamic memory as context, is it possible to ask this api not to allocate more memory if in need?

nand28 on 7 May 2020

Under what circumstances memory will be allocated dynamically in ZSTD_compressCCtx()?

ZSTD_compressCCtx() will request a new allocation if what is already allocated is not large enough for the next compression task to complete.

is it possible to ask this api not to allocate more memory if in need?

If an implementation uses a custom allocator, the custom allocator can enforce any rule it wishes. It can limit the amount of memory available, for example.
Obviously, if the context doesn't receive the memory it needs, the compression will fail, and return a corresponding error code (ZSTD_error_memory_allocation).

Cyan4973 on 7 May 2020

👍1

To expand on @Cyan4973's answer, there are two kinds of ZSTD_CCtxes:

"Dynamic", aka regular cctxes, which are allowed to perform memory allocations (via the ZSTD_customMem functions if provided or otherwise by calling malloc and free).
"Static", created by the user providing a fixed memory buffer to ZSTD_initStaticCCtx(), which will never perform any memory allocations.

When starting a compression, zstd figures out how much memory it will need for its internal datastructures and buffers (here). It can then use the buffer it already has allocated if it's large enough (but not too large). Otherwise, it will free the old workspace and allocate a new one (unless it's a static cctx, in which case it just fails the operation).

Having acquired an appropriately sized workspace, Zstd then internally allocates and initializes its datastructures. When reusing a workspace, this is extremely cheap.

I'm curious if you can describe more about the environment you're using Zstd in where you're concerned about internal allocations. Is it an embedded system? But then you're describing multiple cores.

felixhandte on 7 May 2020

👍1

Thanks for the detailed explanation. This is for storage systems. For performance reasons, we don't want to dynamically allocate memory during compression.

From your explanation, I see that even if we estimate and allocate static memory, it doesn't guarantee that the memory will just be sufficient (not too small or large) for zstd compression to happen without erroring out.

nand28 on 9 May 2020

Nope, that's not the correct conclusion.

Your scenario (one-shot compression, input buffer size not exceeding 32KB) is fairly straightforward, and is expected to be supported. Static allocation is already in use in kernel environments for similar needs, and seems to work fine so far. This ability is also part of our CI test suite, where canonical scenarios are checked before each merge.
The documentation is supposed to clearly specify which guarantees it can offer, and we have not so far received words that it doesn't work as intended.

The difficulty here is understanding and reproducing the issue you are facing (memory allocation errors).
The one detail I noticed in your sample code is that ZSTD_compressBegin() isn't required.

But then, I also do not understand your following answer :

I encountered memory allocation error without using ZSTD_compressBegin(). Only with usage of it before invocation of ZSTD_compressCCtx(), I got memory allocation error.

So, there were memory allocation error without ZSTD_compressBegin(), and then, you added ZSTD_compressBegin(), but it still generates memory allocation errors ?

Cyan4973 on 9 May 2020

Sorry for the confusion. There was a typo.

I encountered memory allocation error without using ZSTD_compressBegin(). Only with usage of it before invocation of ZSTD_compressCCtx(), I did not encounter memory allocation error.

So my problem is ZSTD_compressCCtx returning memory allocation error with static memory for my scenario.

nand28 on 9 May 2020

👍1

In which case, I don't understand what's going on.

As mentioned, this API is tested in CI.
I went through the tests, and reinforced them, so that among the scenarios tested, one does look exactly like yours. And it works as intended : the sequence ZSTD_estimateCCtxSize() -> ZSTD_initStaticCCtx() -> ZSTD_compressCCtx() runs fine, without generating memory allocation error.

So at this stage, I'm not sure what's going on.
Maybe there are other implementation details that matter and could lead to this outcome.

Cyan4973 on 9 May 2020

First usage of the static memory with ZSTD_compressCCtx succeeds. Upon repeated usage, in particular with data having varying compressibility, the API returns memory error.

Is there any reinitialisation required for static context memory in such scenario?

nand28 on 10 May 2020

Ah, this is a different scenario then.

Invoking ZSTD_compressCCtx() several times with the same context should work. No need to re-initialise.

What does mean "upon repeated usage" ? Do you have more precise metric ?

I guess we would need a reproducible scenario.
For reference, the test verifying ZSTD_compressCCtx() using a state allocated with ZSTD_initStaticCCtx() invokes it 3 times, and all 3 times succeed.

Cyan4973 on 10 May 2020

By repeated usage, I mean without reinitializing. This per-cpu static memory usage encountered error after may be 100th time of invocation for compressing an input buffer of size 32k but with different compressibility at each successive invocation.

nand28 on 10 May 2020

Thanks, this gives me enough hint to create a scenario which should reproduce the problem.
Once that's done, I should be able to put a fix and provide a mitigation for existing versions.

Cyan4973 on 11 May 2020

👍1

Does this issue exist with dynamic memory initialized with ZSTD_createCCtx()?
How often would ZSTD_compressCCtx allocate more mem if dynamic mem is passed for the scenario that I have (data with varying compressibility)?
Also by when will the fix be available?

nand28 on 11 May 2020

Does this issue exist with dynamic memory initialized with ZSTD_createCCtx()?

Nope, since it can resize at will.

How often would ZSTD_compressCCtx allocate more mem if dynamic mem is passed for the scenario that I have (data with varying compressibility)?

This entirely depends on the scenario. From what I understand of it, it probably needs adjusting just a few times, and it would then stabilize.

by when will the fix be available?

v1.4.5

Cyan4973 on 11 May 2020

Quick question :
were you using any kind of sanitizer during your tests ?

Cyan4973 on 11 May 2020

Issue reproduced and fixed.
The fix will be present in next release (v1.4.5).

A mitigation exists for the described use case using existing and previous versions of the library, by ensuring that memory reserved for the working context is not too large, thus avoiding the risk of triggering an automatic downsize due to low memory usage (which was the problem), and also saving a lot of memory in the process.

The idea is that the CCtx can live with much less memory if one can guarantee that all inputs to compress will necessarily be <= 32 KB. In which case, the memory budget can be constrained, using for example this code :

size_t insizeMax = 32 * 1000;
ZSTD_compressionParameters cParams = ZSTD_getCParams(xxx_zstd_cmpr_level, insizeMax, 0);
size_t cctxSize = ZSTD_estimateCCtxSize_usingCParams(cParams);
void* buffer = malloc( cctxSize );
ZSTD_CCtx* cctx = ZSTD_initStaticCCtx(buffer, cctxSize);

In above example, the memory budget reserved for cctx is much reduced, on the condition that all inputs are always <= 32 KB. If this condition is not respected, ZSTD_compressCCtx() will fail.

Cyan4973 on 12 May 2020

👍1

Quick question :
were you using any kind of sanitizer during your tests ?

No.

nand28 on 12 May 2020

Issue reproduced and fixed.
The fix will be present in next release (v1.4.5).

Thanks. When is the release of v1.4.5 planned?

A mitigation exists for the described use case using existing and previous versions of the library, by ensuring that memory reserved for the working context is not too large, thus avoiding the risk of triggering an automatic downsize due to low memory usage (which was the problem), and also saving a lot of memory in the process.

This worked. But are there scenarios in which this can break, other than change of input size to something greater than 32K?

The idea is that the CCtx can live with much less memory if one can guarantee that all inputs to compress will necessarily be <= 32 KB. In which case, the memory budget can be constrained, using for example this code :

I didnot face issue with decompression using ctx so far. Does this apply to decompression as well? Should decompression context be sized similarly?

nand28 on 12 May 2020

If we choose to use dynamic memory approach, is it possible to create pool of per-cpu memory that can be used for context allocation for compression and decompression by custom memory allocator?

What would be the constraints for this memory pool, like size of each memory chunk (based on max and min memory requirement for compression and decompression), number of such memory chunks)?

nand28 on 12 May 2020

But are there scenarios in which this can break, other than change of input size to something greater than 32K?

Not really. Using a higher compression level, or different set of custom parameters would break it, as a consequence of requiring more memory. But this is already well documented.

Does this apply to decompression as well?

No. For one-shot decompression, the DCtx memory budget is fixed, independent of any parameter or content length.

If we choose to use dynamic memory approach, is it possible to create pool of per-cpu memory that can be used for context allocation for compression and decompression by custom memory allocator?

Applications which select custom memory allocator can enforce any rule they wish,
including per-cpu memory for context.

In my humble opinion, this strategy is mostly useful when the amount of memory needed is not too large, and can fit into one cpu's cache.
Regarding zstd, this property is valid for low compression levels, but as level increases, memory budget increases too, and it becomes less and less useful to define a "per cpu" affinity, since such amount of memory must come from some larger "shared pool" anyway, like L3 cache or main memory.

Cyan4973 on 12 May 2020

Was this page helpful?

0 / 5 - 0 ratings