Zstd: Train on single file

Created on 19 Sep 2016  Â·  16Comments  Â·  Source: facebook/zstd

I'd like to be able to train a dictionary on a single file; for later re-use. The current CLI requires chopping the file into chunks to do training. If required, I could pass in a string delimiter (effectively treating the single file as multiple files).

Most helpful comment

You can use the recursive mode -r to name just the directory where the 100K+ samples are. No bash expansion limitation in this case.

All 16 comments

A question is, how to provide such string delimiter ?
Syntax itself can be a problem, either too complex, or too limited.
That's a good topic to progress on.

Present use cases are an exact string. Fixed size chunks could happen but I've only seen them in one specialized situation. That said, they'd work as a side effect anyway. So, string delim of up to 256 bytes, or a length of up to say 1meg (arbitrary numbers both) would cover anything I've come up against.

In that unique use case, an offset would have also been usual for training passes for different dictionaries. I don't want this request to try to accommodate too much; but if you do add offset for byte or for delimiter count, you'd cover fixed field and delimiter record sets. Neither are in my use case (and I'm talking unescaped delimiter sets; which most big data would be).

Still suggest limiting chunk size / delimiter scan size arbitrarily to 1meg, just to keep things sane.

Without a clear syntax to tell the program how to slice input into multiple parts, this possibility won't materialize. The syntax issue is the most important one preventing training on a single (large) file.

For my use cases, I simply want a delimiter;
e.g; newline, or an end xml tag, so I can easily train up a dictionary for
later use in per record situations.

-Charles

On Tuesday, September 20, 2016, Yann Collet [email protected]
wrote:

Without a clear syntax to tell the program how to slice input into
multiple parts, this possibility won't materialize. The syntax issue is the
most important one preventing training on a single (large) file.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/facebook/zstd/issues/373#issuecomment-248351435, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIomhhza-9l3TgBcNRq61ON8FSuhGFJks5qsAaOgaJpZM4KACjN
.

Maybe specific cases like that are better handled by splitting the file yourself (csplit for instance may be what you're looking for) or directly using the zstd library?

Will end up using the library, but for a lot of quick utility work, it'd be
nice to be able to work on a file or pipe without creating a clutter of
file entries.

That's why I'm thinking the simple string delimiter would cover things

From this discussion it may simply make more sense to put features like I'm
requesting into a little tool instead of adding it to the zstd binary.

On Tuesday, September 20, 2016, Alexandre Bury [email protected]
wrote:

Maybe specific cases like that are better handled by splitting the file
yourself (csplit for instance may be what you're looking for) or directly
using the zstd library?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/facebook/zstd/issues/373#issuecomment-248419103, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIomlgMN53pNUkSa2D2VQ2GiB-5j4okks5qsD16gaJpZM4KACjN
.

Finding a good syntax to define what a string delimiter is,
which is neither too limited, nor too complex,
is the main issue at stake.

It should preferably follow some already established standard if possible,
like split for example.

To clarify: are you suggesting a regular expression?

All of my use cases are fixed strings; XML is a string, and newline is a
string. If train would take data from stdin, that could easily satisfy more
cases-- for example, I could use cut, to select a column from a tsv file, I
could use grep with a regular expression, only printing out matches. In
both cases my delimiter would be "\n"

-Charles

On Wednesday, September 21, 2016, Yann Collet [email protected]
wrote:

Once again,
finding a good syntax to define what a string delimiter is,
which is neither too limited, nor too complex,
is the main issue at stake.

It should preferably follow some already established standard if possible,
like split for example.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/facebook/zstd/issues/373#issuecomment-248815066, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIompqkPyQKOLC8f3Liy3qAQDiGOm94ks5qsg-0gaJpZM4KACjN
.

Yes,
though we don't need to support all possible regular expressions to begin with.

We may just support the most basic delimiters at the beginning, such as newline separation,

The point is to make sure the syntax can be extended later, and may encompass regular expressions some day in the future.

Makes sense to support the POSIX standard for fixed strings, and to use a different flag for fixed than for regex (if brought in later), much as grep does.

EDIT: The following question is easily answered in ./programs/zstdcli.c

Is there an appropriate value for training that would get approximately the same file size (with dictionary) as compressing a single file? This is about making sure there are enough levers in the CLI.

I think a thin wrapper around split and zstd --train is the simplest solution : here is an example bash script that redirects all arguments to split, allowing any supported pattern:

#!/bin/bash
set -e

D="$(mktemp -d)"
split "$@" "$D/line"
zstd -r --train "$D"
rm -r "$D"

Example usage: ./wrap_train.sh -l 1 input.txt will train each line individually.

EDIT: Updated to remove bash expansion issue.

+1 for training on a single file

I'm trying to train on 100,000 samples. I'm using the split method but running into a bash limitation on the number of input arguments. I think the limit is somewhere around 30,000, which with the 1 sample per file basically means we can only train on 30,000 samples at most. Is that a reasonable training set size? Any other suggestions?

You can use the recursive mode -r to name just the directory where the 100K+ samples are. No bash expansion limitation in this case.

@Cyan4973 thanks! Did not think to look for that option

In latest dev update, it's now possible to cut an input into fixed-size blocks, using command -B#.
For instance, using zstd --train -B4KB bigFile will internally cut bigFile into chunks of 4KB, each one being considered a separate sample. This is equivalent to :
split -b 4k bigFile dir/ && zstd --train -r dir/

Obviously, it's less efficient that having real samples with real beginnings, since if bigFiles effectively contains concatenated samples or roughly 4KB, each sample beginning will be lost in the process, resulting in incorrect statistics.
Still, in general, it provides some good hints on what kind of gains to expect from dictionary compression, so it can be a good evaluation tool.

Was this page helpful?
0 / 5 - 0 ratings