In gcc & clang, the flag -fdebug-prefix-map=old=new allows changing the prefix of source files referred to in debug information.
Related to #34902, this allows one to avoid having the particular source directory that a file was built in affect the output object/executable/library contents.
It also allows source-level debugging to work in cases where the source code is installed by the debug packages to a location that differs from where it was built (debian & OpenEmbedded, at least, take advantage of this).
As an alternate, the program debugedit allows modifying the files after generation to adjust the paths (as far as I can tell, Fedora uses this, and potentially other rpm based distros).
cc @michaelwoerister
Thanks for writing up the issue, @jmesmon!
@rust-lang/tools: Can any of you think of a reason not to add something like this as a -C flag?
Sounds good to me!
A -C flag is good, but sometimes we have seen that certain buildsystems or individual projects like to save compiler flags (i.e. -fdebug-prefix-map=<PATH>=.) into other parts of the overall build output, thereby making it depend on the build path, even if rustc itself supports this -C flag. But it's reasonable to assume that compiler flags will affect the output, and save this somewhere else for auditing.
Because of this, we recently submitted some patches to GCC to also support the same behaviour as debug-prefix-map, except via an environment variable that explicitly should not be saved to any build output. The patches are still pending, but I haven't yet received any significant negative comments on it, and I'll be pinging GCC again soon about it. It would be good if rustc could also support this same environment variable in the future.
Actually, I didn't know about debugedit before, thanks for bringing that up! It might allow us to normalise files outside of GCC or independently of any other compiler, I will have to look into that.
I implemented this and it seems to work. I tested by building src/test/run-make/reproducible-build in two different directories. With -C debug-prefix-map=`pwd`=. the output is reproducible, without it isn't.
(reproducible-build test itself only tests stable symbol naming, not bit-for-bit output. In the past Rust could produce different symbol names between runs(!): see #30330.)
Unclear if it's important for rust, but in gcc/clang multiple mappings are supported. This ends up being important in C/C++ due to #include pulling in files from different directories. If something similar can happen in rust allowing multiple maps would be useful there too.
Also, it could be a good idea to avoid the splitting on = rather than copying gcc's interface to allow = to be included in the old path (though including = in a path is unlikely, it'd be a good idea to avoid leaving behind that landmine if possible).
How do multiple mappings work? Are they applied in command line order? Then is the order significant? I think it must be, since a/b=c a=d applied to a/b results in c, but a=d a/b=c applied to a/b results in d/b.
In gcc, the handling of multiple debug_prefix_maps is to search the mappings last-on-cmdline to first-on-cmdline, applying the first prefix that matches. The last-to-first strategy is common in gcc (and other command line tools) as it is intended that later options are able to override earlier ones.
My GCC patches linked above, includes modifying the existing GCC behaviour to split on the final = instead of the initial one. I think that is better as well, I can imagine someone wanting to map a path that contains a =, but less likely to map something to such a path.
(edit: previously mentioned a space character, that was for some other thing that I got confused with)
Thanks for answers! I will implement multiple mappings, last-to-first order, and splitting on the final = now.
I'd recommend avoiding splitting on = entirely. Is the cmdline interface of rustc flexible enough to handle either having an argument take 2 parameters or allow 2 flags to work together to implement the same thing? (for example, -C debug_prefix_old=foo -C debug_prefix_new=bar and enforcing ordering + pairing).
It would be a really good idea to keep all of the escaping/special characters in callers of rustc (shells, etc) just to avoid funny limitations like this (= being special).
How about requiring the mapping information to be provided in a file (similar to ld's --version-script for example)? Would that be too clunky?
But it seems to me that this is a feature that's only used in specialized settings, so that would seem fine to me.
I'd prefer avoiding needing to use (temporary?) external files to configure this feature. I say temporary because debug src mappings aren't something like a target specification where it is a fixed, predetermined value for all platforms: these are something that depend on the build directory & depend on where the source is being mapped to (which is, in the debug source packaging case) is typically a path under /usr/src/package-name-version, and version is potentially adjusted quite a bit.
And one would need to know the escaping in that file format, so it doesn't simplify things wrt allowing arbitrary paths, it just moves them somewhere else.
So, it seems that discussion on this issue has stalled (here and over in #38348) for two reasons:
/abc/def with xyz, what happens when I encounter /abc/./def or /abc/../abc/def/? What about relative paths?I want to move forward with this so I propose the following solutions:
If no one has a clear, practical reason otherwise, I say we use the CLI as proposed by @infinity0 and myself: have pairs of -Zdebug-prefix-map-from=<...> and -Zdebug-prefix-map-to=<...>, that are matched up nth from to nth to. It is a bit verbose but doesn't require any additional escaping and can handle paths on all platforms.
The semantics are a bit more complicated but I propose that debuginfo paths are generally normalized to not contain . or .. components and that remapping works on absolute versions of these normalized paths. This gives predictable results. UPDATE: Prefix matching works at directory name level, not at the path-string content (see example below).
Some examples:
map: /abc/def -> /xyz
Absolute paths containing the prefix:
/abc/def/file1.rs -> /xyz/file1.rs
/abc/def/build/../file1.rs -> /xyz/file1.rs
/abc/def/./file1.rs -> /xyz/file1.rs
/abc/./def/file1.rs -> /xyz/file1.rs // would not match with gcc
/abc/def/mod1/file1.rs -> /xyz/mod1/file1.rs
/abc/def/mod1/./file1.rs -> /xyz/mod1/file1.rs
Absolute paths not containing the prefix:
/std/file1.rs -> /std/file1.rs // no change
/std/./file1.rs -> /std/file1.rs // normalization
/std/build/../file1.rs -> /std/file1.rs // normalization
Relative paths containing the prefix:
(DW_AT_comp_dir=/abc/def/build, path=../file.rs) => /xyz/file1.rs
(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs // would not match with gcc
(DW_AT_comp_dir=/std, path=./mod1/file.rs) => /std/mod1/file1.rs
Mapping happens at the directory name level, no partial names allowed:
/abc/def-2/file1.rs -> /abc/def-2/file1.rs
/abc/def.rs -> /abc/def.rs
The formula that produces these results is:
fn debuginfo_path(p: Path, map: [(Path, Path)]) -> Path {
let p = normalize(make_absolute(p))
for (from, to) in map {
// Exit on the *first* match, order determined by commandline option order
// UPDATE option order is last to first, i.e. later CLI options overrule earlier ones
if p.starts_with(from) {
return p.replace_prefix(from, to)
}
}
// No remapping done, but still normalized and absolute now
p
}
Note that whether paths are later stored as relative to their DW_AT_comp_dir again is an independent question that I don't want to discuss here.
Thoughts? @jmesmon @infinity0 @sanxiyn @jsgf @rust-lang/tools
I don't see an issue with that so long as the "commandline option order" is defined as last-to-first. Doing this bit differently from other command line utilities doesn't buy us anything (unlike having 2 seperate args for from & to).
It also looks like the examples decide to fix the mapping at directory name level, but doesn't appear in the english description. I don't have anything in mind that would break that, but given that allowing matching partials would allow appending '/' to the end to get matching-full-path-elements, I'm not sure such a restriction is a good idea.
so long as the "commandline option order" is defined as last-to-first
If there's precedent for that I'm fine with going last-to-first.
I don't have anything in mind that would break that, but given that allowing matching partials would allow appending '/' to the end to get matching-full-path-elements, I'm not sure such a restriction is a good idea.
I think it's just simpler to use. You don't have to worry if you need to append a / to avoid accidental renamings.
If there's precedent for that I'm fine with going last-to-first.
This is the order used by gcc & clang for all of their "more than 1 & pick 1" options (ignoring special cases): debug-maps (as discussed earlier in this thread), optimization levels, debug info levels, include directories, etc.
I think it's just simpler to use. You don't have to worry if you need to append a / to avoid accidental renamings.
Sure, it's simpler. The issue is that it's also less flexible, and it's trivial to get the match-full-paths behavior from match-anything, but going the other direction is impossible. Again, I don't have a use case that would want partial matches, but the ease of supporting both should be considered.
This is the order used by gcc & clang
I updated the description above to reflect this.
Sure, it's simpler. The issue is that it's also less flexible, and it's trivial to get the match-full-paths behavior from match-anything, but going the other direction is impossible.
Yes, I know that string-based matching is more powerful. My argument is that it gives you more subtle ways to get it wrong without a clear benefit. Is it from=/abc/ to=/xyz/? from=/abc/ to=/xyz? from=/abc to=/xyz? But I don't have a strong preference. If someone says they absolutely want this, I'm fine with implementing it.
Several points:
I think splitting the option into two is very unpretty, and somewhat ambiguous - if they're just related by being adjacent, it seems like it raises a lot of questions:
from and to get separated by other options?from and to options?from and to be reordered (ie, what if they appear as to from)?In particular it means that any tools that's parsing/processing the commandline needs to know about this special case in order to avoid breaking it.
Normalizing paths by eliminating .. is dangerous if the path contains symlinks: foo/bar/../blat/lib.rs is not the same as foo/blat/lib.rs if bar if a symlink. I'm happy with matching at the directory component level (though string matching is strictly more general), but I think going beyond that is a bad idea.
Edit: I can't think of a problem with eliminating . though. Perhaps that would be useful.
Proposal:
Retain the -Zdebug-prefix-map=OLD=NEW syntax, but also add -Zdebug-prefix-map-separator=X such that subsequent (left to right on the command line) OLD and NEW mappings can be separated by X. This allows a the mappings to contain any character (but not every character), and the tool generating the command line can select a separator that doesn't break the path.
I think @michaelwoerister's suggestions are cleanest; tools that want to add this value to a rustc flag probably don't want to look inside the value to search it and then select a separator.
I agree that matching only full path components are best and less likely prone to error. I am a little concerned that GCC differs a bit from this, but the rustc code example is indeed very simple and it might be possible to ask GCC to adopt a similar approach - I have to send them a patch anyways, I may add this as well. Even if they don't adopt it, we (in the interests of standardising this behaviour) could define a standard that defines a "minimal" mapping behaviour but leave it open saying "the tool might perform further additional mappings".
I am less sure about normalisation, because it has the potential to mess with the semantics of various fields. For example this:
~~~
(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs // would not match with gcc
~
what would you map DW_AT_comp_dir to? / or .? It's unclear, and it messes with the semantics, which is supposed to be "the working directory of the build command". There could be other fields that depend on the original meaning.
There are two cases:
/abc, in which case both fields are reproducible, no need to normalise.PWD but gave a mapping only for a specific child of the PWD. In this case, I think it's better to leave the situation unaltered so that it's at least detectable by reproducers, rather than trying to do something fancy.(edit: "other mappings" -> "further additional mappings")
(To expand on the flags points, I think it's fine to allow from and to to be separated by other options and appear in a non-pairwise order, and in practise no build tool would actually do this but it makes the logic simpler to analyse and write code for; and if they have different numbers of arguments then just fail the compile with an error.)
@jsgf I agree with @infinity0 regarding the CLI options:
What if from and to get separated by other options?
It makes no difference to the semantics.
What if there isn't the same number of from and to options?
Compilation fails with an error.
Can the from and to be reordered (ie, what if they appear as to from)?
It makes no difference. The n-th to is matched up with the n-th from.
Normalizing paths by eliminating .. is dangerous if the path contains symlinks [...]
You're right, that's a problem. An alternative would be to replace normalize(make_absolute(p)) with std::fs::canonicalize(path), which also resolves symlinks. You'd have to know where your symlinks lead to be able to write mappings but I think it's still easier to reason about than what we have today, where things are unpredictable to a large degree.
@infinity0
I am less sure about normalisation, because it has the potential to mess with the semantics of various fields. For example this:
(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs // would not match with gcc
what would you map DW_AT_comp_dir to? / or .? It's unclear, and it messes with the semantics, which is supposed to be "the working directory of the build command". There could be other fields that depend on the original meaning.
DW_AT_comp_dir would stay unchanged as /abc and the path being mapped in this example would be absolute and independent of the corresponding DW_AT_comp_dir. Having a mapping like this would probably be a bug in the build system. The output would be predictable though.
Canonicalizing the path might make things worse. If you're using symlinks to normalize the namespace across multiple machines (a distributed build farm, for example), then canonicalizing the path will denormalize it. I think it would be a mistake to be too clever with the paths.
If you're using symlinks to normalize the namespace across multiple machines (a distributed build farm, for example), then canonicalizing the path will denormalize it.
Yes, that makes sense. No canonicalizing then.
Thinking about this more, I really think this is getting vastly overcomplicated.
I think the debug prefix map should be considered purely an operation on strings: the literal strings passed on the command-line to refer to the input sources (and - I guess - the current directory either from $PWD or getcwd()) with no attempt at processing, normalizing or even taking path elements into account.
This implies that the remapping should be applied before rustc makes them absolute, and rustc should only make them absolute if the output of the remapping is not already absolute (since remapping ./foo/ to /an/absolute/path should be perfectly reasonable).
I think that's a much simpler model to reason about and use. If I think about how I want to use this feature, any path processing on rustc's part makes things more complex to reason about and doesn't solve any of my problems.
Edit: By which I mean, any path processing can be done by tooling outside rustc - if I want normalized paths, I can normalize them. If I want them relative, absolute, etc, etc, I can do that outside, so long as I know that rustc isn't going to do anything complex/clever with them. The more complexity rustc applies, the fewer options I have.
I agree that predictability is the most important thing here. I thought that scheme I proposed above would provide that best but with corner cases cropping up, I don't think that anymore.
Let's try to come up with a different rule set:
rustc ./main.rs, I'll get ./mod1/sub.rs. If I do rustc main.rs, I get mod1/sub.rs.getcwd() + user-provided-path. So ../src/main.rs will become /home/foo/project/build/../src/main.rs, for example.DW_AT_comp_dir=remap(pwd), and DW_AT_decl_file=remap(abs_path).How does that sound?
(cc @luser, who might also be interested in this whole topic)
I think that the mappings should be applied last, after any other processing such as converting to absolute paths. I'd imagine that this would be more predictable, at least from an outsider that is merely observing what rustc does without reading its source code - in other words, one could build both with and without the mappings, and the output would be related in a way that is only based on the mappings and the algorithm, and not on anything else rustc might do now or in the future.
In (3), making all paths absolute, would mean the output can only be reproduced if rebuilders build it in the same path. Is this really necessary? I'd prefer to keep relative paths relative (as GCC does), even if they traverse above cwd. The parent tool that calls rustc could easily add extra ../src=xxx mappings if it feels that this is necessary. The one situation this wouldn't be appropriate, is if these relative paths are generated by rustc itself and are unknown to the parent tool - but does this situation actually occur?
I think that the mappings should be applied last, after any other processing such as converting to absolute paths.
That's what I suggest.
In (3), making all paths absolute, would mean the output can only be reproduced if rebuilders build it in the same path.
Or if they set up their prefix-mapping to result in the same path, right?
Is this really necessary?
It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.
The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked, so it won't know what absolute path applies at that point, and therefore can't do path remapping in those terms. It does know what paths it's putting on the commandline though, so it can generate remappings in those terms.
@michaelwoerister Good point about derived names; I'd overlooked those. (Mostly because I assume they'd have a common prefix from the perspective of mapping.)
@jsgf
The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked
How do you construct your prefix mapping then?
How do you construct your prefix mapping then?
To make it more concrete: DW_TAG_comp_dir always contains the current directory. If you want to map that to something stable, you have to know its value, right?
All relative. I'm not that interested in remapping comp_dir, since actually is no one canonical path that makes sense in my environment. I'm more interested in remapping the source paths to a relative canonical name within the source tree.
To be specific:
The net result is that not only is the current directory some random name, but the path to the sources is prefixed by a buck-generated name. This means that debugging with a cached object will result in meaningless source paths. I want to use prefix remapping to map the path to the source from the symlink tree back to the canonical location in the source tree.
Reproducable builds are of secondary interest because they'd have better caching properties; in that case remapping comp_dir to some fixed (but essentially meaningless) string would help - but only if the rest of the object were bit-for-bit identical.
(Added bonus, I'd like to use the remapped name for error messages, but we can get to that later.)
@jsgf Can you give a small example of what your mapping would like?
It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.
If you're invoking rustc directly then this shouldn't be as much of a problem, since presumably you're not using cargo. If you're using cargo and still need reliable paths you could set CARGO_HOME to a known path and then remap that to a fixed path. We don't have this problem in Gecko (since I fixed the debug info for generics from external crates) because we've vendored all our crates into our source repo, so their source paths are always inside our top source directory.
Something like:
rustc -Zdebug-prefix-map-from=./buck-out/gen/my/build-target#pic,rlib/ -Zdebug-prefix-map-to=./ --other --options ./buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs
ie, remapping the prefix to ./, with the expectation that this will result in DW_TAG_comp_dir being <some random path>, but the DW_TAG_name for each source file would be ./some/path.rs.
BTW, -Z's behaviour of always emitting the "warning: the option Z is unstable and should only be used on the nightly compiler, but it is currently accepted for backwards compatibility; this will soon change, see issue #31847 for more details" warning makes it a non-starter for this.
@jsgf Couldn't you use something like the following?
rustc -Zdebug-prefix-map-from=`pwd`/buck-out/gen/my/build-target#pic,rlib/
-Zdebug-prefix-map-to=/something/
`pwd`/buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs
@michaelwoerister There's no shell involved, so no backtick substitution.
BTW, -Z's behaviour of always emitting the [...] warning ...
The plan is to move this to a stable -C flag once we are sure we want to stabilize it. In the beginning it will start out as unstable though (like all new features).
Still really annoying message. Is there -Z yes-I-know?
Is there -Z yes-I-know?
Not that I know of, unfortunately.
There's no shell involved, so no backtick substitution.
The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative (because they are relative to something different for each crate). This is different from C/C++ where the source for templates is always available when they are instantiated.
The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative
Well, in this case they're either relative to the same path (ie, from the same sourcebase), or their path is meaningless (from crates.io, which are all prebuilt).
The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked, so it won't know what absolute path applies at that point [..]
The net result is that not only is the current directory some random name, but the path to the sources is prefixed by a buck-generated name. [..]
If I understand correctly, this is an argument in favour of applying the mappings first, before any other processing? I think this is better fixed in Buck itself, because it is the one that sets up these symlinks. It should know what cwd is used for each invocation of Rust, so it should be able to construct the example maps suggested by @michaelwoerister that contain pwd even without a shell.
Also if one applies the mappings first, it likely would result in a non-existent path. Then it's unclear how you do "other processing" on this. (If there is no other processing, then "first" and "last" are the same, and we're in agreement.)
Tools that run after rustc has emitted the debuginfo, have no control over what rustc does or how this might change over time. So it is more important to keep the expectations here simple. GCC also apply the remapping last (see gcc/dwarf2out.c in dwarf2out_early_finish).
I think that the mappings should be applied last, after any other processing such as converting to absolute paths.
That's what I suggest.
OK, I think I just got confused by the below point:
Is [making all paths absolute] really necessary?
It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.
Could you explain this in some more detail so I/others could think of how to "solve it differently"? For example this:
The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative
Well, in this case they're either relative to the same path (ie, from the same sourcebase), or their path is meaningless (from crates.io, which are all prebuilt).
this sort of makes sense to me, but I'm not sure what the details are.
@infinity0 Regarding relative paths versus extern crates, let's say you have the following setup:
/libfoo/src <-- contains the source of libfoo
/libfoo/build <-- this is where you build libfoo & cwd of rustc while doing so
If you have a generic function func in libfoo, it's source location would be recorded as ../src/lib.rs.
Now, let's say you compile another library, libbar, that references libfoo and has a similar setup:
/libbar/src <-- contains the source of libbar
/libbar/build <-- this is where you build libbar & cwd of rustc while doing so
If you use foo::func in libbar you end up with a debuginfo entry that tells you that the source code of foo::func can be found in ../src/lib.rs, but we are relative to /libbar/build now, so the debugger would open /libbar/src/lib.rs and show you some unrelated source code. So, at least for items from external crates, in the general case we have to emit absolute paths.
@infinity0 In principle if all the building is happening on one machine, then Buck could control everything. But if the build is being distributed then Buck on machine A could set up an environment that's consistent with relative paths, but be in a different absolute directory on machine B. The problem that paths produced from getcwd() may be absolute, but they are not canonical. (However, if Buck specified everything as absolute paths then they could be made canonical so that the same path works in all environments via the use of symlinks or similar - so I guess it could generate debug-prefix-path options to remap the abs build-time paths to relative.)
For the same reason, I think using absolute paths for the problem @michaelwoerister mentions above is also wrong. We want canonical paths for those source files, not absolute ones. The conflation between absolute and canonical is where I see problems.
Specifically, if you're trying to reconcile the paths for a libfoo build in the context of a libbar build, then I think it's more correct to use the libfoo's comp_dir and relative source paths to construct paths relative to libbar's comp_dir than to use absolute paths.
Specifically, if you're trying to reconcile the paths for a libfoo build in the context of a libbar build, then I think it's more correct to use the libfoo's comp_dir and relative source paths to construct paths relative to libbar's comp_dir than to use absolute paths.
The weirdness here is for generics--they're not actually compiled until you instantiate them with a concrete type, so if you're using Foo<T> from libfoo it's not compiled until it gets used in libbar. What I implemented a while back was to put absolute paths in the metadata with the bytecode instead of the relative paths (which couldn't be resolved later). All it does is join the relative path with the comp_dir, so it should be functionally the same except for not having it in two separate fields.
If you have a generic function
funcinlibfoo, it's source location would be recorded as../src/lib.rs. [and thecwdas/libfoo/build].
The weirdness here is for generics--they're not actually compiled until you instantiate them with a concrete type, so if you're using
Foo<T>from libfoo it's not compiled until it gets used in libbar.
These two comments sound inconsistent - in the first one, where is the source location recorded? But the second comment suggests this is not available?
Anyway, using the example with libfoo and libbar directly above again, there are two cases:
libfoo::x is compiled with cwd=/libfoo/build and name=../src/x. Then we compile libbar. In this case what @jsgf suggested seems sensible to me, i.e. store the name of x in libbar as relpath("/libfoo/build" + "../src/x", "/libbar/build") which would be ../../libfoo/src/x.
libfoo::x is not compiled at first. Then we compile libbar. In this case, libbar has to find x.rs somehow (or some other intermediate file, I don't know) in which case it can still use a relative path as its name?
In both cases we are assuming that /libfoo and /libbar's relative positions to each other are fixed across both builds (even if their absolute paths change). If this is not the case, then at the very least libbar has to "find" libfoo somehow. Whichever directory this "find" algorithm returns, (it could be /libfoo or /libfoo/build, I don't know), call this directory d, then you could additionally store relpath(d, cwd) somewhere in the rust metadata when building libfoo, so that the later libbar could still work with relative paths. It would be slightly more complex, but still easily achievable IMO. (And none of what I described involves canonicalising symlinks, which would mess with Buck.) I think this addresses the distributed scenarios that @jsgf described too, but I'm not familiar with the details so perhaps he could confirm that.
(edit: explain why I think the two comments sound inconsistent)
(edit: to clarify, by relpath(x, y) I mean the path from y to x, like python's os.path.relpath)
libfoo::x is not compiled at first. Then we compile libbar. In this case, libbar has to find x.rs somehow (or some other intermediate file, I don't know) in which case it can still use a relative path as its name?
Sorry, I should have been clearer! The generic code gets converted into some sort of bytecode (I'm sure someone else knows the specifics here), and that's stored in the generated rlib along with some metadata (which I think is handled by librustc_metadata). If you list the contents of an rlib with ar t foo.rlib you'll see that it contains a .o file (the actually compiled bits of the crate) a rust.metadata.bin file and a .bytecode.deflate file. The compiler gets the filename from the metadata for the items it's instantiating from bytecode.
Regardless, given that it has a full path in the metadata, I don't see why getting a relative path to the libfoo comp_dir and then joining the relative path would be better than just taking a relative path from the libbar comp_dir to the full path from libfoo. In your example, we'd have libfoo::x being compiled with relpath("/libfoo/src/x", "/libbar/build") which would still be ../../libfoo/src/x.
In both cases we are assuming that /libfoo and /libbar's relative positions to each other are fixed across both builds.
This seems to be the main problem with this approach. I would be wary of positing that (1) the rule set must support situations where the compilation directory is not known, but (2) it can be assumed that relative positions never change. That seems too tailored to this specific situation to make for a good general rule. Also if there is no common prefix between two paths (as in D:\foo and X:\bar), there is no relative path between them and you have to know the original compilation directory of your upstream crate again to set up a mapping.
@jsgf How about if we introduce a variable __RUSTC_CWD to the mapping syntax? As in:
rustc -Zdebug-prefix-map-from=__RUSTC_CWD/./buck-out/gen/my/build-target#pic,rlib/
-Zdebug-prefix-map-to=/something/
./buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs
A variable like this would be guaranteed to match the prefix of any absolute path the compiler emits.
Items from upstream crates would have their paths map with the mapping given when that upstream crate was compiled originally.
Sure, using the full path to lib.rs would also work. I didn't realise that was available; I was only following the earlier constraint "it's source location would be recorded as ../src/lib.rs". As long as we do get the correct final relative paths, how it's calculated is not so important to me.
However, my overall motivation is to avoid having absolute paths anywhere in the output, both debugging output but also in the rlibs, since they can be installed onto end-user systems. So that's where the second part of my suggestion comes in, that relates to how the "find" algorithm for how other crates are found. Suppose we call d instead crate_dir for clarity, then here's a concrete example:
First we build libfoo with prefix map /path/to/libfoo=/usr/src/rust/libfoo
~~~
libfoo/build/foo.rlib#libfoo.o:
comp_dir = /usr/src/rust/libfoo/build, mapped from /path/to/libfoo/build
name = ../src/lib.rs
libfoo/build/foo.rlib#rust.metdata.bin:
crate_dir = /usr/src/rust/libfoo, mapped from /path/to/libfoo
fn run_foo_generically:
path = ../src/lib.rs
~
This is reproducible regardless of the rebuilder's build directory, as long as they set the right prefix-map.
Later on a different build machine, someone else builds libbar, using libfoo from /my/crates/libfoo, with no prefix map because they don't care about reproducibility, but they do care about correct debugging:
~~~
libbar/build/bar.rlib#libbar.o:
comp_dir = /my/own/libbar/build
name = ../src/lib.rs
fn run_foo_generically:
path = /my/crates/libfoo/src/lib.rs
^ probably this is not exactly how things work, but hopefully "similar enough" that you get the idea
~
And this last value /my/crates/libfoo/src/lib.rs would be calculated via:
joinpath(relpath(comp_dir,crate_dir),run_foo_generically's (rel)path (from rust.metadata.bin) )(This assumes that relpath(comp_dir, crate_dir) is reproducible and not something like ../../../lol/trolls/gonna/troll/libfoo/build.)
Actually, even in the case that someone does want to build libfoo somewhere completely random, they could build it using absolute source path names (which might get remapped). Later, anyone else such as libbar can still recover these source paths relative to the crate_dir, which is all that is needed to correctly resolve the paths on libbar's side.
@michaelwoerister:
How about if we introduce a variable __RUSTC_CWD to the mapping syntax?
After going to the effort of adding the -from/-to variants to the command-line syntax to avoid needing to parse the string for a separator character, I don't think adding in metasyntactic variables is very consistent with that.
But more generally, I think there's a somewhat irreconcilable problem: sometimes absolute paths are the right thing to use, and sometimes relative is, on a purely case-by-case basis.
(My background summary for my own benefit)
For code that compiled directly to pure object file, the answer is pretty clear: each object has a corresponding set of files with a DW_AT_comp_dir and source paths relative to that. Tools can try to construct their own abspath by combining them, but also construct relative paths purely from the source names, or by having their own source_path to apply to each filename.
The problem arises from code that isn't completely generated into object code at its own build time, but defers it to later specialization in the context of some other module.
C/C++ has no problem with this, because this is always performed in source terms; the second compilation always needs to refer to the source of the first compilation, so we know they're at least being compiled in the same namespace, and the DW_AT_comp_dir+relative path will resolve to something meaningful.
In Rust it's trickier because there's always a compilation to a form of object file (either .rlib or .so), so the original sources are never needed, and the second compilation could be in a completely different filesystem namespace, making the concept of "path to source" for the first compilation potentially meaningless. Or they could be in the same namespace, but with no meaningful relative relationship. Or they could be in the same source tree, but the absolute position of that tree might be different from build to build (or between builder and editor).
So, given that, the questions that occur to me are:
Does Dwarf have a way of expressing what we want here?
comp_dir=.../libfoo src path ../src/thing.rs inlined and specialized into comp_dir=.../libbar src path some/other/path.rs"How does C/C++ handle this with
Hm, on closer inspection, it looks to me like DW_AT_comp_dir and DW_AT_name for the source are a red herring; rustc only seems to generate an entry for the top-level lib.rs, and the rest of the sources don't appear there.
The real action is happening in the "Directory Table" (in readelf output, include_directories in the DWARF spec) and the "File Name Table" (file_names). It seems that rustc always generates bare names in the file name table (lib.rs), and generates a new directory entry for each path within the crate, as full paths (/my/source/is/here/libfoo/src/submodule).
The DWARF spec says the file names are relative to either DW_AT_comp_dir or a specific entry in the directory table.
So, for example, libfutures.rlib has:
The Directory Table (offset 0x1b):
1 /my/full/path/to/futures/src
2 /my/full/path/to/futures/src/future
3 /my/full/path/to/futures/src/stream
4 /my/full/path/to/futures/src/sink
5 /my/full/path/to/futures/src/task_impl
6 /my/full/path/to/futures/src/sync
7 /my/full/path/to/futures/src/sync/mpsc
[...]
The File Name Table (offset 0x5f1):
Entry Dir Time Size Name
1 1 0 0 lib.rs
2 2 0 0 mod.rs
3 2 0 0 lazy.rs
4 0 0 0 <std macros>
5 2 0 0 flatten_stream.rs
6 2 0 0 join.rs
7 2 0 0 select.rs
8 2 0 0 chain.rs
9 2 0 0 join_all.rs
10 2 0 0 select_all.rs
11 2 0 0 select_ok.rs
[...]
This means that lib.rs is relative to /my/full/path/to/futures/src, etc. I think rustc is wrong here. It should only have one path to this crate's sources, and then each name in the filenames list should be relative to that. For example:
Assuming DW_AT_comp_dir is /my/full/path/to/futures/src:
The Directory Table (offset 0x1b):
1 /my/full/path/to/futures/src
[...]
The File Name Table (offset 0x5f1):
Entry Dir Time Size Name
1 1 0 0 lib.rs
2 1 0 0 future/mod.rs
3 1 0 0 future/lazy.rs
4 0 0 0 <std macros>
5 1 0 0 future/flatten_stream.rs
6 1 0 0 future/join.rs
7 1 0 0 future/select.rs
8 1 0 0 future/chain.rs
9 1 0 0 future/join_all.rs
10 1 0 0 future/select_all.rs
11 1 0 0 future/select_ok.rs
[...]
so that all the names are also sensible relative to DW_AT_comp_dir as well as having an absolute path. (This is post remapping if the actual build happened in a separate build dir.)
What's more, code from other modules can use different directory entries:
The Directory Table (offset 0x1b):
[...]
8 /my/path/to/rust/1.14/src/rust/src
[...]
The File Name Table (offset 0x5f1):
Entry Dir Time Size Name
[...]
41 8 0 0 liballoc/arc.rs
[...]
So I think this goes back to @michaelwoerister's comment above about how rustc should construct the pathnames to emit based on the top-level source passed to rustc, and make sure that it propagates that all the way through to the DWARF info without rewriting them as absolute dir + filename.
@jsgf The other files appear in DW_AT_decl_file attributes. This is how DWARF encodes paths. Paths in the file name table are relative to their directory table entry, paths in the directory table are relative to the DW_AT_comp_dir of the compilation unit. It's done in a way that maps well to C include directories but there is no clear rule how this should be used (which doesn't mean that the way rustc does it now is a good one).
@michaelwoerister DW_AT_decl_file compute the file index for a piece of the code, but the ultimate filename comes from the file name and directory tables.
@infinity0 In your example from above, when compiling libbar, I don't think the compiler would be able to know that the source of libfoo could be found under /my/crates/libfoo. It only knows where libfoo.rlib is, which is independent of libfoo's source location.
DW_AT_decl_file compute the file index for a piece of the code, but the ultimate filename comes from the file name and directory tables.
Yes, sorry, that didn't make too much sense. The important thing is that the file-name and directory tables are just an arbitrary encoding. It seems that GCC for example does the same as rustc (maybe because it takes the least space).
@michaelwoerister I see, OK. This situation is indeed not handled well by DWARF. But I think it's possible to make it work, going back to the crate_dir concept. Instead of recording paths to libfoo relative to libbar, you could record these paths as a special dummy path like crate://libfoo/$relative_path, where $relative_path is as I suggested above, calculated using crate_dir and other information in libfoo.rlib.
I think this sort of solution is inevitable given the differences from C/C++ that @jsgf just talked about in detail. If I build libbar against libfoo.rlib but I might not have the latter's source code, but I want my users to potentially be able to get its source code to debug my library, then the only theoretically possible thing to do is to refer to a relative path, that is relative to some "abstract idea" of where the source code of libfoo might be.
Anyway, this is going slightly away from the topic of prefix maps. To summarise, I think we should separate the concerns here:
to support debugging source code from other libraries libfoo (whose source code might not be available at build time, but debugging symbols are), this can be achieved by recording crate_dir and using virtual paths (e.g. crate://libfoo/path/to/file/containing/a/generic/symbol)
to support reproducible builds, this can be achieved by using prefix-maps to transform paths as usual, e.g. crate_dir, comp_dir, name, those things in the Directory Table, etc.
Actually, if it's possible/suitable for rustc to auto-detect the top-level crate_dir of any given file that it's compiling, these prefix-maps aren't even necessary at all - all paths can simply be made relative to this directory, and debuggers can recreate these paths on their side no problem. Prefix-maps are only really needed when the build tool does not know the "top-level" directory of the source code and needs a parent buildsystem to pass it in, which is the case for traditional C compilers but not usually the case for more modern languages.
Perhaps I missed this, but is there a reason that we can't remap the paths when they are being stored into an rlib, to avoid the need to map them later on? ie: do the remap at the point where the source location is known.
By doing this, non-cargo builders have the control they need, and we'd just need some higher level support for path-remapping in cargo to handle it's multiple rustc invocations.
@infinity0 Your idea of "abstract source" matches up with what I had in mind as the primary motivation for prefix mapping and can also be implemented via regular prefix mapping if we do what @jmesmon suggests:
When you invoke the compiler something like this rustc ../src/lib.rs, you can have a mapping ../src/ -> [email protected]/ (if you are using relative paths) or /absolute/path/to/your/src/ -> [email protected]/ where [email protected] is a kind of abstract name for the given source code. In a later step the consumer (like GDB) can have it's own mapping for that or you remap it again with debugedit. For this to work, we need to do what @jmesmon says: remap paths as they are stored or -- equivalently -- store the mapping with the crate and apply it to anything that comes from that crate. This way, things coming from other crates potentially are already in the known "abstract source space" and you know what to do with them.
Below is a description of the new mapping algorithm I plan to implement. I think it is a synthesis of the insights that emerged from the discussion above and I also think that it meets the different requirements. Please let me know if you have any feedback!
fn map_path_for(path, def_id)
{
// Always use the mapping that was defined for the given crate. A DefId
// knows which crate it's coming from.
let mapping = get_mapping_for(def_id);
let (mapped_path, was_affected_by_mapping) = mapping.map(path);
if def_id.is_from_local_crate()
{
// For the local crate we are done because all inputs are relative to
// the working directory of the current compiler invocation
return mapped_path
}
if path.is_absolute()
{
// For absolute paths there's nothing more to do since those
// don't depend on the working directory of any compiler invocation
return mapped_path
}
// At this point we know that we have the path for something that has
// been inlined into the current crate and it has a path that is relative
// to the compiler's working directory when the upstream crate was
// compiled.
// We need to find out if the path that the debugger is going to look up
// was in any way affected by path-remapping:
// - If yes, we assume that this was intentional and don't mess with it.
// - If no, we need to make the path absolute so that debuggers can still
// find it.
if was_affected_by_mapping || cgu_working_dir_is_affected_by_mapping(def_id)
{
// Assume the user knows what they are doing
return mapped_path
}
else
{
// Make absolute so that the debugger can find the source file
return get_cgu_working_dir_for(def_id).join(mapped_path)
}
}
A B
\ /
C ... C depends on A and B
CRATE A
working dir = /crates/A/build
src = /crates/A/src/lib.rs
CRATE B
working dir = /crates/B/build
src = /crates/B/src/lib.rs
CRATE C
working dir = /crates/C/build
src = /crates/C/src/lib.rs
What the compiler sees:
CRATE A
working dir = /crates/A/build
src = ../src/lib.rs
CRATE B
working dir = /crates/B/build
src = ../src/lib.rs
CRATE C
working dir = /crates/C/build
src = ../src/lib.rs
Example 1, build directory independent configuration:
Mapping for A: ../src => /mapped-src-of-A
Mapping for B: ../src => /mapped-src-of-B
Mapping for C: ../src => /mapped-src-of-C
RESULTS:
working directory = /crates/C/build
function in C, defined in /crates/C/src/foo.rs => /mapped-src-of-C/foo.rs
function in C, imported from A, defined in /crates/A/src/bar.rs => /mapped-src-of-A/bar.rs
function in C, imported from B, defined in /crates/B/src/baz.rs => /mapped-src-of-B/baz.rs
Example 2, map all crates into "common source space":
Mapping for A: /crates/A/build => common-src-root
../src => A/src
Mapping for B: /crates/B/build => common-src-root
../src => B/src
Mapping for C: /crates/C/build => common-src-root
../src => C/src
RESULTS:
working directory = common-src-root
function in C, defined in /crates/C/src/foo.rs => C/src/foo.rs
function in C, imported from A, defined in /crates/A/src/bar.rs => A/src/bar.rs
function in C, imported from B, defined in /crates/B/src/baz.rs => B/src/baz.rs
What the compiler sees:
CRATE A
working dir = /crates/A/build
src = /crates/A/src/lib.rs
CRATE B
working dir = /crates/B/build
src = /crates/B/src/lib.rs
CRATE C
working dir = /crates/C/build
src = /crates/C/src/lib.rs
Example 1, build directory independent configuration:
Mapping for A: /crates/A/src => /mapped-src-of-A
Mapping for B: /crates/B/src => /mapped-src-of-B
Mapping for C: /crates/C/src => /mapped-src-of-C
RESULTS:
working directory = /crates/C/build
Function in C, defined in /crates/C/src/foo.rs => /mapped-src-of-C/foo.rs
Function in C, imported from A, defined in /crates/A/src/bar.rs => /mapped-src-of-A/bar.rs
Function in C, imported from B, defined in /crates/B/src/baz.rs => /mapped-src-of-B/baz.rs
Example 2, map all crates into "common source space":
Mapping for A: /crates/A/src => common-src-root/A/src
Mapping for B: /crates/B/src => common-src-root/B/src
Mapping for C: /crates/C/src => common-src-root/C/src
RESULTS:
working directory = /crates/C/build
function in C, defined in /crates/C/src/foo.rs => common-src-root/C/src/foo.rs
function in C, imported from A, defined in /crates/A/src/bar.rs => common-src-root/A/src/bar.rs
function in C, imported from B, defined in /crates/B/src/baz.rs => common-src-root/B/src/baz.rs
Again, please let me know if you have any feedback!
It looks like pseudo code should have return path replaced with return mapped_path? Otherwise no mapping is going to take place :)
On making the paths for inlined functions absolute: is that something that gcc/clang/other compilers do? It seems like relative paths could still be correct there (ie: relpaths can work in the case where the whole source+binary are moved together, while abspaths can work when source is not moved by binary is). This seems minor though.
Otherwise, this sounds like it's at the point where we just need to try it out 馃憤
It looks like pseudo code should have return path replaced with return mapped_path? Otherwise no mapping is going to take place :)
You might be on to something there :) I fixed the code snippet, thanks!
On making the paths for inlined functions absolute: is that something that gcc/clang/other compilers do?
No this is something that is specific to Rust. In C/C++ each compiler invocation has the correct relative paths available because of header files. In Rust we don't have header files and don't have a connection between an rlib and the source it was compiled from.
I'd imagine that lto in gcc/clang could mirror (the path mapping issues with) rust's function inlining as in the lto case the functions to be inlined (if the compiler chooses to do inlining) would be from another translation unit.
I'm still digesting the above, but it looks good so far.
Wishlist request: an option to make paths/filenames in compiler messages the remapped version, rather than the input version.
Triage: there's been a ton of discussion on this issue, but I have no idea what the current state of it is. can anyone summarize?
We are using --remap-path-prefix in Debian and it works to make ripgrep reproducible.
rustc itself is still not reproducible however, and I don't know why. That is tracked in #34902. There have been quite a few regressions.
Yes, it's available on stable so I'm closing this issue.