Bazel: Allow any characters in filenames

Created on 13 Aug 2015  Â·  67Comments  Â·  Source: bazelbuild/bazel

Ultimately any character can be part of a filename. We should probably allow that.

Some mangling to generate the corresponding label should probably be done.

Original report on the mailing-list:
https://groups.google.com/d/msgid/bazel-discuss/CAN0GiO3__5jXo5rZqroSj0mFxpqCzUZZVkY%3DSNsJK1%2BZ1BdJLg%40mail.gmail.com

P2 misc > misc team-Bazel feature request

Most helpful comment

I have a change to Blaze (Googlers: cr/263443121) that makes it use UTF-8 for all external text strings, and UTF-16 for all internal strings. If all goes well it should land this week.

All 67 comments

  1. So are we talking Unicode?
  2. Where does _any character_ stop?
  3. how do you treat characters that are not allowed on certain platforms?
  4. When using mangling, how do you handle collisions?

In POSIX, filenames are "bags of bytes"--there is no encoding; however, NUL and / are not allowed. Windows has a few more restrictions. Perhaps the BUILD file should be parsed in the encoding of the system locale, usually UTF-8, and filenames run though a ValidForCurrentPlatform() function which checks for disallowed characters. However, opting for strict platform neutrality in this way means that Bazel would have to represent filenames as a bag of bytes and not a Unicode string, as there is no guarantee that the filename will roundtrip through Unicode correctly. The problem can probably be simplified by restricting filenames to be UTF-8 or UTF-16, which should cover most people's needs even though that's not strict POSIX.

Well, I think we can probably require valid UTF-8 file names and strongly recommend that people use UTF-8 for their file system. For labels / BUILD files, we probably need an escaping scheme, at least for the control characters. If there's a file that isn't valid UTF-8, we give an error message?

Our company codes mainly in C++, but our frontend uses a lot of JS and nodejs modules which have all sorts of characters in the filenames--for example, -, #, @, (, and ).

Right now this is a major blocker for getting all our codebase under one build system since we can't reference files with semi-special characters. I don't think Bazel should decide what characters are acceptable in file names, as that reduces file names to those that fit both (1) supported languages and (2) supported platforms. This seems unnecessarily restrictive, and is becoming a major pain point for us.

Agreed. Unfortunately, it's a bit tricky to fix, as a lot of code assumes that the mapping from labels to file names (and vice versa) is trivial, and doesn't require escaping. Any suggestions on an escaping scheme?

URL based?

You mean an own URI scheme? Sounds good.

I mean replacing special characters by %XX where XX is the UTF-8 code in hexa.

Sorry, I won't be able to work on this. @philwo had an interest, maybe he can make some progress here. :-/

This blocks our Bazel deployment as well.

This is blocking us. We have a templating system where we need to build our template files. The filenames themselves contains template variables (e.g. ${ServiceName}.java ). Both $, {, and } are not supported by Bazel in file names.

I totally agree that this is important, should be done, I want this myself, however I don't have the time to work on it in the coming months, thus I have to unassign it.

Here is my proposal:

  • Metacharacters (:, %, =, any others) must be %-encoded
  • All other characters are allowed. This includes Unicode characters.
  • Non-UTF8 names are not allowed, even if escaped. This is because there is no good way to handle them cross-platform, nor to display them to the user.

Plain ASCII (and even that partial) makes this feels like we are in the early 90s.

There are reasonable ways to handle that.
For POSIX using the default locale would be good enough.

If my project is C/C++, and it is cross-platform, and I have problems handling Unicode, then I will not use Unicode in file names. And the fact that bazel "explodes" is not such a problem.
But if I use something like Java, then "it just works", and bazel would work too.

Even better would be to to allow for a character-set option in the project file.
This is what maven does. And what Java does with -Dfile.encoding=UTF-8
So if it is there, use it. If not, then take the system charset.

I did not move one project to bazel because test units check that Unicode file names work.
So the files are there, make it through git, work with maven and ant and gradle and java.
But bazel fails because there is an "@" in the file name... Which is supported on all OSes.

Missing support for "$" is really painful, as this char is a valid one in the Java file name, yet we are getting this:

ERROR: /home/davido/projects/prolog-cafe/BUILD:51:1: //:builtin: invalid label 'java/com/googlecode/prolog_cafe/builtin/PRED_$atom_type0_2.java' in element 0 of attribute 'srcs' in 'java_library' rule: invalid target name 'java/com/googlecode/prolog_cafe/builtin/PRED_$atom_type0_2.java': target names may not contain '$'.

Buck build tool supports this out of the box and it really sucks, to go the srcjar zip route in this case in Bazel.

FTR this has become high priority and is now my main tasks. But this might take a long time before it is default (because we have to cover all corner cases).

//CC @dborowitz @spearce

Given this major drawback of current Bazel implementation, I'm procrastinating to migrate the last Gerrit Code Review sub-project from Buck to Bazel: prolog-cafe, and wait with this migration, until it is fixed. Thoughts?

When we use Prolog Cafe from blaze, we get around this limitation by using a Python script to build a srcjar (zip file) containing all the source files outputted by the Prolog Cafe compiler. This srcjar is a valid input to a java_library rule. I can share this Python script downstream in our Prolog Cafe fork.

Not directly related to this bug, but another issue we ran into with blaze is it doesn't like having multiple top-level classes in a single .java file; I forget exactly what happens, but it either fails fast or messes with caching. The same Python script that creates the srcjar also takes care of splitting up top-level classes into their own .java files.

Update: https://docs.google.com/document/d/1ducs753wqYoE6ibxYYwnVlX7RSw7YQiVRNiMz-hgYQE/ is the draft design doc. The quoting is subject to change but the rest should be pretty stable.

This also affects Angular / TypeScript using Bazel.

We install packages from NPM to run in nodejs. NPM uses the @ character as a "scoped package", eg. @types. When installing external dependencies with my repository_rule, the packages are installed, and I write a BUILD file into node_modules to allow depending on the files:

project/
  node_modules/
    BUILD
    @types/
      foo/index.d.ts

However, I cannot make a label in this location that points to @types/foo/index.d.ts because it gets interpreted as a workspace name: invalid label '@types/foo/index.d.ts' in element 5 of attribute 'srcs' in 'filegroup' rule: invalid repository name '@types/foo/index.d.ts': workspace names may contain only A-Z, a-z, 0-9, '-', '_' and '.'.

Workaround for me is to put the BUILD file one level higher, but this is really worse because that location should be controlled by the user, not my repository rule.

'@' is not allowed in the package name, it can probably be added immediately (only you need to add a leading colon to avoid confusion for remote repository)

@damienmg is that something we could get in now? It would be really helpful, my workaround is not that clean. If it's useful and you point me to the right spot I can make the contribution.

Almost two years later after this bug was created, any progress here?

As a compromise, the most annoying limitation is not supporting the "$" character for Java eco-system, as this is a valid character in class names. My suggestion would be to forget all other characters but hard code the support for "$" in target names in Bazel, and deliver that feature in the next 2 months and not all chars in the next 20 years.

To overcome this limitation we are forced to "hide" the file names from Bazel now: [1].

Buck version:

java_library(
  name = 'builtin',
  srcs = glob([SRC + 'builtin/*.java'], exclude = REPL + IO) +
  [
    ':builtin_srcs',
    ':system_srcs',
  ],
  deps = [':lang'],
)

Bazel version:

# TODO(davido): Fix that mess when this major Bazel bug is fixed:
# https://github.com/bazelbuild/bazel/issues/374
# That why I left the original :builtin rule from Buck, so that
# you can feel my pain, to emulate the glob with zip, to hide
# the files that contain '$' from Bazel.
genrule(
    name = "builtin_srcjar",
    outs = ["builtin.srcjar"],
    cmd = " && ".join([
        "TMP=$$(mktemp -d || mktemp -d -t bazel-tmp)",
        "ROOT=$$PWD",
        "cd java",
        "zip -q $$ROOT/$@ com/googlecode/prolog_cafe/builtin/*.java",
        "zip -qd $$ROOT/$@ com/googlecode/prolog_cafe/builtin/PRED_\$$write_toString_2.java %s" % " ".join([s[5:] for s in IO]),
    ]),
    local = 1,
)

java_library(
    name = "builtin",
    srcs = [
        ":builtin_srcjar",
        ":builtin_srcs",
        ":system_srcs",
    ],
    deps = [":lang"],
)

What a mess?!

@damienmg any update?

Sorry I haven't got time to work on this recently, I have been on holiday and working from China right now. I need to do further investigation to allow @.

Note that , $, ( and ) are now allowed in globs and filegroup though you might hit other problem along the way, it works fine for creating jar that takes those as argument.

We still have to consolidate the design to be able to move forward with allowing any characters in filename.

Note that , $, ( and ) are now allowed in globs and filegroup though

Thanks. It worked here.

Has this change been released yet? I don't like relying on a custom built bazel if possible.

Yes

On Thu, Jul 6, 2017, 6:04 AM MacRae Linton notifications@github.com wrote:

Has this change been released yet? I don't like relying on a custom built
bazel if possible.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/374#issuecomment-313239811,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADjHf8ofzxdQNAPNtzzY3uBs5XGAq81aks5sLAhngaJpZM4Fqx6g
.

Yes

@damienmg Can you mention in what Bazel release support for $ character was actually added? I was using 0.5.2rc2 and it worked here (https://github.com/GerritCodeReview/prolog-cafe/commit/9eaffdc036c405d0199d3e2a441896541b6e7c1c).

I believe it is since 0.4.5 but you'll have to check the release notes for
which one exactly.

On Thu, Jul 6, 2017, 11:40 AM David Ostrovsky notifications@github.com
wrote:

Yes

@damienmg https://github.com/damienmg Can you mention in what Bazel
release support for $ character was actually added? I was using 0.5.2rc2
and it worked here (GerritCodeReview/prolog-cafe@9eaffdc
https://github.com/GerritCodeReview/prolog-cafe/commit/9eaffdc036c405d0199d3e2a441896541b6e7c1c
).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/374#issuecomment-313286873,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADjHfwHsSb7TNmpxGGLGn540JdDjuAdkks5sLFctgaJpZM4Fqx6g
.

My company is pushing R&D teams to use bazel. We have a lot of projects with files have names contain non-ascii charactors, so supporting non-ascii names is very important for us. We have a workaround:remove the restriction and transfer filenames into locale format, hope these feature coud be supported in a short future

Can you send a PR to discuss the possibility of accepting it? As far as I
understand our only problem with those characters is actually printing
them, there is no reason not to allow them. Add @philwo and I to the review.

On Tue, Jul 18, 2017, 12:42 AM DuXiutao notifications@github.com wrote:

My company is pushing R&D teams to use bazel. We have a lot of projects
with files have names contain non-ascii charactors, so supporting non-ascii
names is very important for us. We have a workaround:remove the restriction
and transfer filenames into locale format, hope these feature coud be
supported in a short future

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/374#issuecomment-315906184,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADjHf9wXjy7FNfc7n_UpUQACbibjuBKpks5sO-NCgaJpZM4Fqx6g
.

@ddwolf I agree with Damien - it would be great if you could share your workaround as a pull-request, just to see what kind of changes are required to make it work for you.

I think we're currently not dealing with character encoding in a well defined way in Bazel - some code knows about the "latin-1 tunnel" to deal with unknown characters, other code treats everything as UTF-8, ... but I'm happy to audit it and see how we can clean it up.

@damienmg @philwo thank you for you replies, Jianfeng Yu would issue a pull request shortly after. have you considered to import escape charactors to make "label name" support any charactors?

Yes this is the long term solution, but it takes time to agree on how
exactly to do it (and there are always other fire to fight).

On Mon, Jul 24, 2017 at 5:04 AM DuXiutao notifications@github.com wrote:

@damienmg https://github.com/damienmg @philwo
https://github.com/philwo thank you for you replies, Jianfeng Yu would
issue a pull request shortly after. have you considered to import escape
charactors to make "label name" support any charactors?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/374#issuecomment-317308140,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADjHfw-YhfH4X5ruK5WgKsIHKHm6RnL7ks5sRAmrgaJpZM4Fqx6g
.

Ok we just had a discussion with @philwo in which we agreed on how to move forward (Philip, correct me if I missed something):

  • We are going to reserve '%' for quoting of labels. '%%' is going to be used for a percent in the label and '%:' is going to be used for colons in label. All other character will stay unquoted for now. We might add more quoting possibilities new use case requires it.
  • We are going to allow all characters in filename, including non ASCII, right now. And add test cases we already had in this doc plus any other test case we encounter. All tests will be extendable with characters / string to inject in the label to make sure we can add any filenames.
  • We are going to fix any encoding issue we encounter when we had more test case. We want to use the local to read the filename from the file system. Maybe we will need to add a special method to delete files that have filenames not allowed with the locale encoding. We might want to mess around with locale in our test cases. [1]

[1] If we have a filename with weird characters, we need to write it with shell quoting in our tests (e.g. xF0x0BxA1x2), so maybe we don't need to mess around with locale.

Also we are going to track the encoding on this bug too so adding philwo as assignee because he also wants to do some work to support it.

Joining the group here. We are affected by the % character being disallowed. We are using an external http archive, so can not change anything in the files or create zips of files.

Just found this bug. @damienmg could you update the milestone on this with a current estimate? Thanks!

Milestones are no longer relevant. Nobody's actually working on that at the moment AFAICT

Landing here from #167.

Trying bazel for the first time today. My code lives in ~/Dropbox (Personal)/foobar and bazel refuses to build it because the workspace's path has a space on it. I tried creating a symlink to ~/foobar but didn't help. All I get is this:

ERROR: bazel does not currently work properly from paths containing spaces (com.google.devtools.build.lib.runtime.BlazeWorkspace@24eea938).

This is tricky because neither bazel supports space in paths and dropbox won't let me rename that directory. :)

@fiorix I'm painted into the same corner. I need to develop an Adobe plug-in inside ~/Library/Application Support, but Bazel chokes on the space. I'm going to poke at the Bazel source and see if I can find an easy way to suppress that error.

Looks like the no spaces error was introduced in https://github.com/bazelbuild/bazel/commit/1373653f6d4903963abdd5daceabf3193fa240f4 by @kchodorow 3 years ago. Based on the more recent comments on this issue, I'm presuming that this error is obsolete and can now safely be removed.

I don't have my machine set up to build Bazel now, but I might take a stab at removing this error and seeing what breaks. 😃

There are a couple things in Bazel that may not be whitespace-safe.

Bazel generates runfiles and fileset manifests in a text format which uses whitespace as a separator. There are also a few generated wrapper scripts, especially for java_binary and py_binary, as well as a test wrapper script (test-setup.sh), which may not be whitespace-safe. I don't expect any problems with the Java code as such.

I started a change that allows more characters in labels, but I think it still excludes whitespace - though note that that is separate from whitespace in the path to the workspace itself.

Thanks @ulfjack. A few things I'm not clear on:

  • Are you still actively working on this? Seems that this bug has passed through a handful of people (Kristina, Damien) who no longer work on Bazel; it may need an owner.
  • Do you expect it's currently safe to use spaces in paths?
  • What would need to change to get an LGTM removing the assertion that a path doesn't contain spaces?

I am not currently working on this.

By all means, give it a try. If we audit (and, if necessary, fix) the places mentioned above, and maybe add an integration test, and it works, that's good enough for me. There may be hidden bugs, but we won't find them if nobody is trying it.

I ran into this problem and it was difficult to debug.

Process exited with status 1: Process exited with status 1 was the only error output.

ERROR: /project/tools/protobuf/BUILD:11:1: Creating runfiles tree bazel-out/k8-fastbuild/bin/tools/protobuf/pbts.runfiles failed: Process exited with status 1: Process exited with status 1
Target //tools/protobuf:pbts failed to build
Use --verbose_failures to see the command lines of failed build steps.

--verbose_failures gave no information. No other options I could find gave any extra output about the error. I finally determined that it was a new node_modules dependency* that had been added and then within the runfiles MANIFEST.tmp file, the last line was a file with spaces in the filename. If I removed the files with spaces, then the problem was corrected.

Is there some other debug or verbose flag that would give better information about this failure?

I think 0.16 gives a better error message. Just ran into the same error message a few days ago, and colleague received a proper reason by updating bazel on his machine. @skinner

I tested with 0.16 also before reporting the info.

Echo'ing @kellycampbell, not easy to debug that bazel is unhappy with whitespace's in python's setuptools.

Luckily those files are safe to ignore..

py_runtime(
    name = "python3",
    files = glob(
        ["dist/**"],
        exclude = [
            # These bad boys have spaces in their name, and this kills the bazel.
            # https://github.com/bazelbuild/bazel/issues/374
            # https://github.com/pypa/setuptools/issues/746
            "dist/lib/python3.6/site-packages/setuptools/command/launcher manifest.xml",
            "dist/lib/python3.6/site-packages/setuptools/script (dev).tmpl",
        ],
    ),
    interpreter = "dist/bin/python3",
    visibility = ["//visibility:public"],
)

Is this still a problem with Bazel 0.17.2?

Asking for "any" character is perhaps making this feature request more complex to implement than is really required for most users.

I think it would be reasonable, from a perspective of maintaining the cross-platform objectives of bazel, to simply allow the full set of ascii characters which are allowed in both Windows and Posix file names (that is, everything except /, \, ?, %, *, :, |, ", <, >). That would make it much easier to integrate with third-party code which uses such characters in file names. In particular +, ( and ) are common everywhere, and npm frequently uses @. Spaces are a bit tricky because they need to be escaped correctly when building command lines, perhaps, but nothing about it seems insurmountable.

_"full set of ascii characters which are allowed in both Windows and Posix file names"_

Why _"full set of ascii characters ... allowed"_ and not _"full set of characters ... allowed"_ ?
Why ASCII only? Windows and POSIX allow for Unicode characters, for a very-very long time.

Since we need a way to escape spaces, and figure out all the places we need to patch to make it work, then we can just escape Unicode. It's not much more work. It is not the escaping that is hard, it is

If this would be a tool that is someone's toy project, whatever.
But if the intention is to be adopted and used, then let's make it be the best that it can be!

It would be a shame to keep it ASCII, this is (almost) 2019, not 1990s :-)

Note that this will break rules_go when Go 1.12 is out due to the presence of tests with utf8 chars in the file:
https://github.com/golang/go/tree/master/test/fixedbugs/issue27836.dir

@philwo Is this really your problem, or do you want to send it to others?

@aiuto Thanks for checking in. Yep, I'm the owner of this and fine with it. I also have ideas how to fix and test this, but so far other issues were more pressing. :|

cc @alandonovan

Somewhat related: C++ SG15 has been discussing how to name files in JSON, and current thinking is use JSON UTF-8 for "bags of bytes" that roundtrip through UTF-8 (with no normalization allowed) and arrays of integers for everything else. https://wg21.link/P1689.

I have a change to Blaze (Googlers: cr/263443121) that makes it use UTF-8 for all external text strings, and UTF-16 for all internal strings. If all goes well it should land this week.

@alandonovan What's the current status on your patch? Has it landed in the public repo yet?

I've run into this problem and been able to work around it by patching Bazel to use new java.io.File(new java.net.URI(...)) in some places, since URIs contain unambiguous octets instead of codepoints. But it'd be nice to have a real solution in place.

I think changing this is more difficult than expected, and will not be feasible in a single change.

The review raised many good questions and the task fell onto my back burner. Some were minor errors; some were performance concerns that require benchmarking; and some related to possible behavior changes and the need for a three-phase transition with a flag. I'm still optimistic that the behavior change is quite minor and can be fairly described as a set of bug fixes, but I have yet to see how the change affects the 'federation' set of tests, which I think should determine the approach.

This means we still can't use Bazel to build boost 1.70.0 per #8108

This also fails for trillinos: @trilinos//:all: invalid label 'packages/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank1.cpp' in element 14026 of attribute 'srcs' in 'filegroup' rule: invalid target name 'packages/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank1.cpp': target names may not contain ':'

Just adding another example of an issue this causes.

I'm trying to see if I can download Chromium as an http_archive for a Bazel target for our end to end tests. However, at least on mac, the Chromium app package contains directories and files with spaces in the name (e.g. 'Chromium Framework.framework').

I could create a glob that specifies down to the correct folder, but then if I update chromium versions and the folder structure changes, my target will break. I haven't been able to find a work around yet.

Would like to offer example of an issue this causes:

I'm trying to apply a filegroup to the output created by the Unity game engine. They use some files with spaces whose names are hardcoded into the engine. I cannot change them without breaking the engine. I have a hack to get around this for now but it is unfortunate that bazel is incapable of tracking an essential file.

What I ended up doing is adding an Unity editor plugin to rename copy the file(s) to the same place with spaces replaced by underscores after building. Then, before running I have a script in bazel symlink them back to their original names so it can be found by the game engine. I'm sure there is a better way, but if a file could have spaces it would mean I wouldn't need any hacks.

Was this page helpful?
0 / 5 - 0 ratings