Dear ZSTD Friends, Hope You All Doing Well! :)
Today I was looking into dictBuild APIs to see how it fit in this scenario, and have a question to ask (not critical) ...
Is your feature request related to a problem? Please describe.
Samples were distributed on multiple nodes, while I could gather them together, it's not the best option to fit the infra.
So check if this is possible and supported, on each node/host, I can call ZDICT_trainFromBuffer to make a dict buffer from partial samples, now on the server it pulls all dict buffers, and if it can merge all of them into a single one, or retrain dict buffers if that makes sense at all.
Describe the solution you'd like
size = train/merge(dictBuffers, nbBuffers, finalDictBuffer, maxBufferSize)
Describe alternatives you've considered
pull all samples into the server and run one time ZDICT_trainFromBuffer on all samples, the downside is there might be too many samples, and system has to decide to drop some, may or may not impact the result, also may not be fast enough in terms of bytes to be transferred.
Additional context
n/a
Hi Shawn !
Hope you are doing great at Pinterest these days !
make a dict buffer from partial samples, now on the server it pulls all dict buffers, and if it can merge all of them into a single one
merging (concatenating) multiple dictionaries into one will not provide good value
or retrain dict buffers if that makes sense at all.
training a new dictionary from a list of previously generated dictionaries will likely provide _some_ results,
but it's difficult to say if this will be especially better.
Placed in a similar situation, my first idea would to randomly sample :
select a budget you are willing to spend to extract samples from each node (for example, 100 samples per node and per day),
randomly select them from the list of samples stored locally each day,
join all these samples in a single place,
run the dictionary builder on these selected samples.
The idea is that a dictionary should find common sequences present across many samples. So if they are common, they should be statistically present and still common in a randomly selected sub-section of the samples.
The resulting dictionary is likely going to be of similar quality than the one produced by previously suggested method (building dictionaries locally, and then training a final dictionary from several local ones). I would even expect it to be slightly better on average, due to better preservation of context and correlation. Perhaps a more important benefit is that it skips a training stage per node, which can be cpu and memory taxing.
Note that the "random" in "randomly select" is very important, avoid just picking a slab where all samples share the same timeframe for example.
Hope it helps
Hi Yann, thank you so much for the reply. BTW, I really miss you and the ZSTD team, :)
Most helpful comment
Hi Shawn !
Hope you are doing great at Pinterest these days !
merging (concatenating) multiple dictionaries into one will not provide good value
training a new dictionary from a list of previously generated dictionaries will likely provide _some_ results,
but it's difficult to say if this will be especially better.
Placed in a similar situation, my first idea would to randomly sample :
select a budget you are willing to spend to extract samples from each node (for example, 100 samples per node and per day),
randomly select them from the list of samples stored locally each day,
join all these samples in a single place,
run the dictionary builder on these selected samples.
The idea is that a dictionary should find common sequences present across many samples. So if they are common, they should be statistically present and still common in a randomly selected sub-section of the samples.
The resulting dictionary is likely going to be of similar quality than the one produced by previously suggested method (building dictionaries locally, and then training a final dictionary from several local ones). I would even expect it to be slightly better on average, due to better preservation of context and correlation. Perhaps a more important benefit is that it skips a training stage per node, which can be cpu and memory taxing.
Note that the "random" in "randomly select" is very important, avoid just picking a slab where all samples share the same timeframe for example.
Hope it helps