I have a binary that requires a significant number of generated data files generated from one source file.
As far as I can tell, Bazel has no way to express this. The architecture of Bazel forces us to enumerate our outputs during the Loading phase, but we won't have the list of outputs until we parse the source file, and we can't perform any parsing actions during the Loading phase.
Here are some workarounds I've considered:
1) I could zip the runfiles up during the build and declare only that zip as a data file. At runtime, I could run a wrapper script to unzip them into a temp directory before running the real binary. But it's a lot of files; launching the binary will be unacceptably slow.
2) The build is hermetic, so the output files are technically predictable; I could run the generator once, enumerate the files, and declare them all explicitly as outputs in my BUILD file. But then the BUILD file would just duplicate the data from the source file; I'd really prefer not to do that.
3) I tried generating the data files in a directory and declaring just the directory as an output, but I got a scary warning that "dependency checking of directories is unsound."
The current recommended best practice is 1). This is how Java does things. You may be able to use tar to do this faster. 2) is a valid way to do that. Could you re-use out the logic to generate the file list in your code and use that to generate the BUILD file?
Is this a bug in Bazel that can be fixed? In particular, can you explain what's wrong with option 3 and how/whether dependency checking directories could be done soundly?
I've seen the documentation that says, "dependency checking of directories is unsound," but I don't get it. Verifying the integrity of a zip file containing a tree of files is just as sound as verifying the integrity of the files unextracted. It's admittedly slower to stat/checksum a bunch of files than it is to stat/checksum a single zip file, but it's equally sound, right?
As for Option 2, that's a bad option because it means whenever anyone changes the source file, we'll have to do a manual pre-Bazel step to generate the BUILD file; Bazel can't generate its own BUILD files until after the loading phase.
You are correct about how 2 would have to work. You could add some checking to verify that the number of outputs matches the expected number of outputs so that while people will have to re-run a script to generate the BUILD file, Bazel will check that you got all the outputs right.
One of the Bazel devs will have to weigh in to get you more clarity, but my understanding is that Bazel treats each file as a target that can be referenced elsewhere. That's pretty well baked into Bazel, so any changes to that behavior would take a lot of work. You would have to convince them that there is enough value for this to be worth the work, and even then, it would be a while before they would have time. Using a directory also isn't a well-tested mechanism and likely won't be. That puts you on a lightly tested code path, which should also be a consideration.
We are already working on allowing actions to generate a set of output files which is subsequently treated as a unit. As Austin said, it's a significant change, and it will be a while before it's fully working and stable.
Any updates on this? I'm trying to generate API clients for various languages from a single Swagger spec and don't know ahead of time which files will be generated.
You can create a directory output in a Skylark rule with ctx.actions.declare_directory, see here:
https://docs.bazel.build/versions/master/skylark/lib/actions.html#declare_directory
I can't say off the top of my head how to consume the directory from a downstream action.
You can consume it by declaring a dependency on the return value from declare_directory. That'll make the directory available in the downstream action. I think it may not work with remote execution yet.
Is there a fix for getting declare_directory
to work remotely? Or an issue to follow?
Most helpful comment
We are already working on allowing actions to generate a set of output files which is subsequently treated as a unit. As Austin said, it's a significant change, and it will be a while before it's fully working and stable.