Zstd: Use relative #include paths for easier integration into other build systems

Created on 29 Aug 2016 · 14Comments · Source: facebook/zstd

Integrating ZStd into other "non-standard" build systems is slightly harder than it should be, primarily because it requires setting up custom #include paths for the source files.

E.g. zbuff_compress.c does an #include "error_private.h" which is a file in a different folder.

Changing all the includes to be relative to the source file location would make this problem go away, e.g. if the same include was #include "../common/error_private.h" and so on.

Source

aras-p

Most helpful comment

I want to note that I've hit this particular issue as well. In general being able to use a third party library without rearranging or changing headers is nice; I'd like to be able to just add zstd as a folder or as a git submodule and compile it without adding internal zstd folders to the include path. I can do this with many third-party libraries, including lz4 ;) but not with zstd.

If relative include paths are inconvenient one potential solution is to just put all zstd source files and internal headers in a single folder... (that is, to change zstd repository structure so that there's just one flat source folder)

zeux on 31 Aug 2017

👍4

All 14 comments

Actually, the #include "file.h" was a deliberate design,
because we don't know the folder structure of the user project,
so files could be regrouped differently.

For example, sometimes, all files are regrouped in a single directory.

By "delegating" the path setting to the build system,
it makes it possible to re-arrange files in any way the user wants.

But indeed, it requires the user to setup paths, which is a task.

It seems difficult to find an ideal solution here, which would please both situations.
Making the #include relative to current file would skip the need to setup paths, but would also oblige all user projects to preserve the directory structure of zstd repository.

Not sure which one is better ...

Cyan4973 on 29 Aug 2016

Yeah that makes sense. Perhaps point that out in projects/readme.md?

If you're on the fence which way is better, then I don't think I ever saw a 3rd party library integration where files were rearranged (but I might have very limited data). Keeping file structure makes it easier to upgrade to later versions, since it's just "copy over" instead of flattening the files again.

aras-p on 29 Aug 2016

👍2

Re-arranging the files is an investment of future time multiplied for every platform.
We edit in the relative paths and dedupe the repeated functions and then:
namespace zstd1{
extern "C"{
#include "1/zstd-0.8.1/lib/common/entropy_common.c"
#include "1/zstd-0.8.1/lib/common/xxhash.c"
#undef FORCE_INLINE
#include "1/zstd-0.8.1/lib/common/zstd_common.c"
//
#include "1/zstd-0.8.1/lib/compress/fse_compress.c"
#include "1/zstd-0.8.1/lib/compress/huf_compress.c"
// #include "1/zstd-0.8.1/lib/compress/zbuff_compress.c"
#include "1/zstd-0.8.1/lib/compress/zstd_compress.c"
//
#undef FORCE_INLINE
#include "1/zstd-0.8.1/lib/common/fse_decompress.c"
#include "1/zstd-0.8.1/lib/decompress/huf_decompress.c"
// #include "1/zstd-0.8.1/lib/decompress/zbuff_decompress.c"
#include "1/zstd-0.8.1/lib/decompress/zstd_decompress.c"
// #include "1/zstd-0.8.1/common/zstd_static.h"
}
#undef ERROR
}

stati64 on 30 Aug 2016

zeux on 31 Aug 2017

👍4

Actually, I was going to say that I planned to dropped this ticket because experience showed us it's great to be able to move files around without the need to modify them (all #include statements remain the same).

Of course, it depends on having all relevant directories listed in include path. That's something the provided Makefile does, also the cmake script, and also the Visual solution.

But indeed, it's always possible to have _another_ build environment, for which no script/config is provided. In which case, it's necessary to set these directories, which is an annoyance.

What I could propose then, is another directory, like lib-flat for example, where all relevant files would be present. This is similar to @zeux suggestion, though it would be an additional directory.

The simplest way to do it would be to add target files as symlinks. Unfortunately, this does not work for Windows.

So it could be replaced by some script, something like lib-flat/createLibFlat.sh for Unix and lib-flat\createLibFlat.bat for Windows. They would simply copy (or symlink) the relevant files into target directory.

Unfortunately, there is also the question of "what means relevant files ?".
This is the reason why source files are separated into multiple directories to begin with.
I suspect decompression is a must, and compression is likely wanted too.
But it's less clear if the dictionary builder is wanted. And even less clear for legacy format support.

Should we add options to these scripts then ?
As can be seen, what started as a trivial work around sounds like a mini project now...

Better have the specs right before starting.

Cyan4973 on 1 Sep 2017

There's a way for this to not be a large effort or require a spec - just flatten the hierarchy inside lib/. zstd uses few files so I don't think I see the value in hierarchy inside that folder. You can use name prefixes if there's really a need to split files in multiple groups.

This solution is easy, pragmatic and maintains the option of moving files into different folders in case somebody wants to do that for whatever reason (is there really a use case for shuffling files from zstd/lib around?)

zeux on 1 Sep 2017

The primary reason for the sub-directory structure in /lib is to allow users to include or disable specific features. It's probably something a newcomer will not think about initially. But experience tells that when integrating the library, many users will choose to _not_ include lib/deprecated directory. Nor the lib/legacy one. And while at it, not even the lib/dictBuilder one.
More rarely, some would only need the decoder, or just the encoder.

Having a directory structure make these choices relatively easy to apply. Moreover, code lives, with new files added, existing ones being split, old ones being deprecated. Such situations are also relatively simple to manage : compulsory files get into lib/common, compression related ones go into lib/compression, etc. Existing build scripts automatically pick up such changes, and continue generating numerous variants with more or less features enabled or disabled.

I'm not against another solution, especially if it can help other use cases and build system, but such alternative shall preserve current capability to selectively drop or enable library features.

Cyan4973 on 6 Sep 2017

Right. The way I see it there are several conflicting requirements here:

We'd like to have a folder structure inside lib/ to separate different components from each other cleanly
We'd like to be able to add source files to an arbitrary build system without having to configure include paths
We'd like to be able to have #include directives have just the file name, to make it possible to reshuffle folder structure however the library user sees fit
We'd like to keep the distribution/build process simple and unambiguous

Current solution solves 1, 3, 4 at the expense of 2.
Changing includes to be relative solves 1, 2, 4 at the expense of 3.
Introducing symlinks or separate build scripts that create a different set of sources solves 1, 2, 3 at the expense of 4.
Switching to a flat file structure solves 2, 3, 4 at the expense of 1 (technically it makes 3 moot instead of solving, but whatever).

One other solution that hasn't really been discussed here is amalgamation (making a script that merges all internal headers/sources into one .c file, similar to sqlite and some other libraries) - it sort of violates 4, although not as strongly as symlinks, and makes 1 impossible, but solves 2,3 and makes zstd completely trivial to integrate (adding just one file to an arbitrary build system of choice - libraries that are structured like this are fantastic to use, lz4 basically worked that way if you didn't use HC mode).

I guess from my perspective, 3 is way less valuable than 2 - not solving 2 breaks basic user expectations (I can't add zstd to my Makefiles/Visual Studio projects/etc./etc. without adding internal zstd folder to my include settings or changing zstd files or moving them around), not solving 3 feels like some niche use case that I don't understand isn't satisfied. Obviously I'm not the project developer so I don't necessarily have the same view of these problems - all I'm saying is that 2 was (and is) a problem for me, and I've never encountered the need to do 3 for any libraries.

zeux on 8 Sep 2017

👍1

These are valid talking points @zeux .
Amalgamation is a topic I've been interested in lately.
As I see it, it's interesting purely for ease of integration, as we already know that compilers do not like such giant source files (performance suffer, as optimisation heuristics get drowned in the huge number of opportunities to analyse). But even if there is a little speed hit, that may be still worth it.

The way I imagine it could work is to produce it with a dedicated script.
We still need to work on clearly separated files during development.
We could, for example, create another directory, like lib-amalgamated, where the source code would be simplified to a zstd.h and a zstd.c. We could also make sure it's always updated before each release.

There are however 2 remaining issues :

There is still the issue of "what's the wanted scope" ? We could select a default package, providing typically "compression + decompression", leaving away all other features (dictBuilder, Legacy support). In which case, it would be nice if the script could be run by any user with a different set of preferences.
Amalgamation is "not trivial". cat *.c is not enough, files have to be edited, especially #include statements, order of inclusion is important, many symbols that were isolated now become duplicated, etc.

At the end of the day, I feel the issue is not so different from previously described idea of providing a lib-flat directory, where all "relevant" files would be present directly a root. There is the same issue of selecting the "right" scope (which can default to the same compression+decompression scope), it can also be prepared in advance for each release, the script could also be allowed to run by users with the possibility to select different options.
But the big advantage is that it only copy files, so it's much simpler than amalgamation.
As a bonus, it's a bit better for runtime performance, and compatible with partial and multi-threaded compilation.
The downside, compared to amalgamated source, is that it requires the host project to setup one directory. That's better than several ones, but, indeed, more than zero.

Cyan4973 on 8 Sep 2017

As I see it, it's interesting purely for ease of integration, as we already know that compilers do not like such giant source files (performance suffer, as optimisation heuristics get drowned in the huge number of opportunities to analyse). But even if there is a little speed hit, that may be still worth it.

Has this been the case for zstd? My experience does not match that at all - there absolutely are issues with large functions with complex CFGs, but I haven't seen pessimisation from large sources where individual functions are reasonable - if anything, in some cases amalgamation improves performance due to better inlining (which you could also get by using whole program optimization, but that significantly slows down linking and isn't available on all platforms).

With respect to the scope of amalgamated library, what I was thinking of is just the default compression+decompression function set (that's likely the 90% use case). And yeah there definitely can be edits that are required for amalgamation to work (for zstd specifically I actually tried compiling it as an amalgamated source, the issues are pretty minor - CHECK_F macro is redefined in multiple source files, and one function, ZSTD_limitCopy, is defined twice).

zeux on 8 Sep 2017

@zeux zstd compression saw a small speed boost when breaking a large file up into multiple segments, see PR #831.

terrelln on 8 Sep 2017

@terrelln I'm not quite sure I understand whether this is a speed boost or just noise; I see compression and decompression speeds going both up and down in the first gist linked in that issue.

zeux on 9 Sep 2017

@zeux some of it is just noise, I didn't present the data very well since I didn't write it with presentation in mind. The difference shows itself on my devserver with all the gcc compilers on level 1 and 3. I'm certain those aren't just noise, since I repeated the benchmarks many times with consistent results. The difference doesn't show up on entropy, where those levels are about the same speed before and after.

terrelln on 11 Sep 2017

We expect the number of files to continue to grow for each category in the future.
For the time being, the current directory design has proven to work pretty well, and will continue to serve us well as the code base expands.

I understand that setting up #include directories in 3rd party build systems can be a pain.
A proposed solution is to provide scripts which would copy all relevant files into a dedicate "single-level" (flat) directory, so all files are in the same directory, and #include work out of the box.

We can certainly develop such script if there is a need.

As this topic has not been active for quite some time now, I'll close it. We can certainly re-open it if needed.

Cyan4973 on 23 Oct 2018

Was this page helpful?

0 / 5 - 0 ratings