cc @michaelwoerister

jmesmon on 12 Dec 2016

Thanks for writing up the issue, @jmesmon!

@rust-lang/tools: Can any of you think of a reason not to add something like this as a -C flag?

michaelwoerister on 12 Dec 2016

Sounds good to me!

alexcrichton on 12 Dec 2016

A -C flag is good, but sometimes we have seen that certain buildsystems or individual projects like to save compiler flags (i.e. -fdebug-prefix-map=<PATH>=.) into other parts of the overall build output, thereby making it depend on the build path, even if rustc itself supports this -C flag. But it's reasonable to assume that compiler flags will affect the output, and save this somewhere else for auditing.

Because of this, we recently submitted some patches to GCC to also support the same behaviour as debug-prefix-map, except via an environment variable that explicitly should not be saved to any build output. The patches are still pending, but I haven't yet received any significant negative comments on it, and I'll be pinging GCC again soon about it. It would be good if rustc could also support this same environment variable in the future.

Actually, I didn't know about debugedit before, thanks for bringing that up! It might allow us to normalise files outside of GCC or independently of any other compiler, I will have to look into that.

infinity0 on 12 Dec 2016

I implemented this and it seems to work. I tested by building src/test/run-make/reproducible-build in two different directories. With -C debug-prefix-map=`pwd`=. the output is reproducible, without it isn't.

(reproducible-build test itself only tests stable symbol naming, not bit-for-bit output. In the past Rust could produce different symbol names between runs(!): see #30330.)

sanxiyn on 13 Dec 2016

👍1

Unclear if it's important for rust, but in gcc/clang multiple mappings are supported. This ends up being important in C/C++ due to #include pulling in files from different directories. If something similar can happen in rust allowing multiple maps would be useful there too.

Also, it could be a good idea to avoid the splitting on = rather than copying gcc's interface to allow = to be included in the old path (though including = in a path is unlikely, it'd be a good idea to avoid leaving behind that landmine if possible).

jmesmon on 13 Dec 2016

How do multiple mappings work? Are they applied in command line order? Then is the order significant? I think it must be, since a/b=c a=d applied to a/b results in c, but a=d a/b=c applied to a/b results in d/b.

sanxiyn on 13 Dec 2016

In gcc, the handling of multiple debug_prefix_maps is to search the mappings last-on-cmdline to first-on-cmdline, applying the first prefix that matches. The last-to-first strategy is common in gcc (and other command line tools) as it is intended that later options are able to override earlier ones.

jmesmon on 13 Dec 2016

My GCC patches linked above, includes modifying the existing GCC behaviour to split on the final = instead of the initial one. I think that is better as well, I can imagine someone wanting to map a path that contains a =, but less likely to map something to such a path.

(edit: previously mentioned a space character, that was for some other thing that I got confused with)

infinity0 on 13 Dec 2016

Thanks for answers! I will implement multiple mappings, last-to-first order, and splitting on the final = now.

sanxiyn on 14 Dec 2016

I'd recommend avoiding splitting on = entirely. Is the cmdline interface of rustc flexible enough to handle either having an argument take 2 parameters or allow 2 flags to work together to implement the same thing? (for example, -C debug_prefix_old=foo -C debug_prefix_new=bar and enforcing ordering + pairing).

It would be a really good idea to keep all of the escaping/special characters in callers of rustc (shells, etc) just to avoid funny limitations like this (= being special).

jmesmon on 14 Dec 2016

How about requiring the mapping information to be provided in a file (similar to ld's --version-script for example)? Would that be too clunky?
But it seems to me that this is a feature that's only used in specialized settings, so that would seem fine to me.

michaelwoerister on 14 Dec 2016

I'd prefer avoiding needing to use (temporary?) external files to configure this feature. I say temporary because debug src mappings aren't something like a target specification where it is a fixed, predetermined value for all platforms: these are something that depend on the build directory & depend on where the source is being mapped to (which is, in the debug source packaging case) is typically a path under /usr/src/package-name-version, and version is potentially adjusted quite a bit.

And one would need to know the escaping in that file format, so it doesn't simplify things wrt allowing arbitrary paths, it just moves them somewhere else.

jmesmon on 14 Dec 2016

So, it seems that discussion on this issue has stalled (here and over in #38348) for two reasons:

It's not clear what the command interface should look like. Passing a map of paths without introducing additional escaping rules is harder than it sounds.
The initial implementation in #38348 revealed that the semantics of remapping are more complicated than it might seem at first. Is remapping based on strings or logical paths? E.g. if I replace /abc/def with xyz, what happens when I encounter /abc/./def or /abc/../abc/def/? What about relative paths?

I want to move forward with this so I propose the following solutions:

If no one has a clear, practical reason otherwise, I say we use the CLI as proposed by @infinity0 and myself: have pairs of -Zdebug-prefix-map-from=<...> and -Zdebug-prefix-map-to=<...>, that are matched up nth from to nth to. It is a bit verbose but doesn't require any additional escaping and can handle paths on all platforms.
The semantics are a bit more complicated but I propose that debuginfo paths are generally normalized to not contain . or .. components and that remapping works on absolute versions of these normalized paths. This gives predictable results. UPDATE: Prefix matching works at directory name level, not at the path-string content (see example below).
Some examples:

map: /abc/def -> /xyz

Absolute paths containing the prefix:
/abc/def/file1.rs -> /xyz/file1.rs
/abc/def/build/../file1.rs -> /xyz/file1.rs
/abc/def/./file1.rs -> /xyz/file1.rs
/abc/./def/file1.rs -> /xyz/file1.rs  // would not match with gcc
/abc/def/mod1/file1.rs -> /xyz/mod1/file1.rs 
/abc/def/mod1/./file1.rs -> /xyz/mod1/file1.rs 

Absolute paths not containing the prefix:
/std/file1.rs -> /std/file1.rs // no change
/std/./file1.rs -> /std/file1.rs // normalization
/std/build/../file1.rs -> /std/file1.rs // normalization

Relative paths containing the prefix:
(DW_AT_comp_dir=/abc/def/build, path=../file.rs) => /xyz/file1.rs
(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs  // would not match with gcc
(DW_AT_comp_dir=/std, path=./mod1/file.rs) => /std/mod1/file1.rs

Mapping happens at the directory name level, no partial names allowed:
/abc/def-2/file1.rs -> /abc/def-2/file1.rs
/abc/def.rs -> /abc/def.rs

The formula that produces these results is:

fn debuginfo_path(p: Path, map: [(Path, Path)]) -> Path {
    let p = normalize(make_absolute(p))
    for (from, to) in map {
        // Exit on the *first* match, order determined by commandline option order
        // UPDATE option order is last to first, i.e. later CLI options overrule earlier ones
        if p.starts_with(from) {
           return p.replace_prefix(from, to)
        }
    }
    // No remapping done, but still normalized and absolute now
    p
}

Note that whether paths are later stored as relative to their DW_AT_comp_dir again is an independent question that I don't want to discuss here.

Thoughts? @jmesmon @infinity0 @sanxiyn @jsgf @rust-lang/tools

michaelwoerister on 18 Jan 2017

I don't see an issue with that so long as the "commandline option order" is defined as last-to-first. Doing this bit differently from other command line utilities doesn't buy us anything (unlike having 2 seperate args for from & to).

It also looks like the examples decide to fix the mapping at directory name level, but doesn't appear in the english description. I don't have anything in mind that would break that, but given that allowing matching partials would allow appending '/' to the end to get matching-full-path-elements, I'm not sure such a restriction is a good idea.

jmesmon on 18 Jan 2017

so long as the "commandline option order" is defined as last-to-first

If there's precedent for that I'm fine with going last-to-first.

michaelwoerister on 18 Jan 2017

I don't have anything in mind that would break that, but given that allowing matching partials would allow appending '/' to the end to get matching-full-path-elements, I'm not sure such a restriction is a good idea.

I think it's just simpler to use. You don't have to worry if you need to append a / to avoid accidental renamings.

michaelwoerister on 18 Jan 2017

If there's precedent for that I'm fine with going last-to-first.

This is the order used by gcc & clang for all of their "more than 1 & pick 1" options (ignoring special cases): debug-maps (as discussed earlier in this thread), optimization levels, debug info levels, include directories, etc.

jmesmon on 18 Jan 2017

I think it's just simpler to use. You don't have to worry if you need to append a / to avoid accidental renamings.

Sure, it's simpler. The issue is that it's also less flexible, and it's trivial to get the match-full-paths behavior from match-anything, but going the other direction is impossible. Again, I don't have a use case that would want partial matches, but the ease of supporting both should be considered.

jmesmon on 18 Jan 2017

This is the order used by gcc & clang

I updated the description above to reflect this.

michaelwoerister on 18 Jan 2017

Sure, it's simpler. The issue is that it's also less flexible, and it's trivial to get the match-full-paths behavior from match-anything, but going the other direction is impossible.

Yes, I know that string-based matching is more powerful. My argument is that it gives you more subtle ways to get it wrong without a clear benefit. Is it from=/abc/ to=/xyz/? from=/abc/ to=/xyz? from=/abc to=/xyz? But I don't have a strong preference. If someone says they absolutely want this, I'm fine with implementing it.

michaelwoerister on 18 Jan 2017

👍1

Several points:

I think splitting the option into two is very unpretty, and somewhat ambiguous - if they're just related by being adjacent, it seems like it raises a lot of questions:

What if from and to get separated by other options?
What if there isn't the same number of from and to options?
Can the from and to be reordered (ie, what if they appear as to from)?

In particular it means that any tools that's parsing/processing the commandline needs to know about this special case in order to avoid breaking it.

Normalizing paths by eliminating .. is dangerous if the path contains symlinks: foo/bar/../blat/lib.rs is not the same as foo/blat/lib.rs if bar if a symlink. I'm happy with matching at the directory component level (though string matching is strictly more general), but I think going beyond that is a bad idea.

Edit: I can't think of a problem with eliminating . though. Perhaps that would be useful.

Proposal:

Retain the -Zdebug-prefix-map=OLD=NEW syntax, but also add -Zdebug-prefix-map-separator=X such that subsequent (left to right on the command line) OLD and NEW mappings can be separated by X. This allows a the mappings to contain any character (but not every character), and the tool generating the command line can select a separator that doesn't break the path.

jsgf on 18 Jan 2017

I think @michaelwoerister's suggestions are cleanest; tools that want to add this value to a rustc flag probably don't want to look inside the value to search it and then select a separator.

I agree that matching only full path components are best and less likely prone to error. I am a little concerned that GCC differs a bit from this, but the rustc code example is indeed very simple and it might be possible to ask GCC to adopt a similar approach - I have to send them a patch anyways, I may add this as well. Even if they don't adopt it, we (in the interests of standardising this behaviour) could define a standard that defines a "minimal" mapping behaviour but leave it open saying "the tool might perform further additional mappings".

I am less sure about normalisation, because it has the potential to mess with the semantics of various fields. For example this:
~~
(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs // would not match with gcc
~~
what would you map DW_AT_comp_dir to? / or .? It's unclear, and it messes with the semantics, which is supposed to be "the working directory of the build command". There could be other fields that depend on the original meaning.

There are two cases:

We have a mapping for /abc, in which case both fields are reproducible, no need to normalise.
We don't have a mapping for abc, in which case I'd say that there is an issue (could even call it a bug) with the tool that sets these mapping flags - it knowingly set the PWD but gave a mapping only for a specific child of the PWD. In this case, I think it's better to leave the situation unaltered so that it's at least detectable by reproducers, rather than trying to do something fancy.

(edit: "other mappings" -> "further additional mappings")

infinity0 on 18 Jan 2017

(To expand on the flags points, I think it's fine to allow from and to to be separated by other options and appear in a non-pairwise order, and in practise no build tool would actually do this but it makes the logic simpler to analyse and write code for; and if they have different numbers of arguments then just fail the compile with an error.)

infinity0 on 18 Jan 2017

👍1

@jsgf I agree with @infinity0 regarding the CLI options:

What if from and to get separated by other options?

It makes no difference to the semantics.

What if there isn't the same number of from and to options?

Compilation fails with an error.

Can the from and to be reordered (ie, what if they appear as to from)?

It makes no difference. The n-th to is matched up with the n-th from.

Normalizing paths by eliminating .. is dangerous if the path contains symlinks [...]

You're right, that's a problem. An alternative would be to replace normalize(make_absolute(p)) with std::fs::canonicalize(path), which also resolves symlinks. You'd have to know where your symlinks lead to be able to write mappings but I think it's still easier to reason about than what we have today, where things are unpredictable to a large degree.

michaelwoerister on 18 Jan 2017

@infinity0

I am less sure about normalisation, because it has the potential to mess with the semantics of various fields. For example this:

(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs // would not match with gcc

what would you map DW_AT_comp_dir to? / or .? It's unclear, and it messes with the semantics, which is supposed to be "the working directory of the build command". There could be other fields that depend on the original meaning.

DW_AT_comp_dir would stay unchanged as /abc and the path being mapped in this example would be absolute and independent of the corresponding DW_AT_comp_dir. Having a mapping like this would probably be a bug in the build system. The output would be predictable though.

michaelwoerister on 18 Jan 2017

Canonicalizing the path might make things worse. If you're using symlinks to normalize the namespace across multiple machines (a distributed build farm, for example), then canonicalizing the path will denormalize it. I think it would be a mistake to be too clever with the paths.

jsgf on 18 Jan 2017

If you're using symlinks to normalize the namespace across multiple machines (a distributed build farm, for example), then canonicalizing the path will denormalize it.

Yes, that makes sense. No canonicalizing then.

michaelwoerister on 18 Jan 2017

Thinking about this more, I really think this is getting vastly overcomplicated.

I think the debug prefix map should be considered purely an operation on strings: the literal strings passed on the command-line to refer to the input sources (and - I guess - the current directory either from $PWD or getcwd()) with no attempt at processing, normalizing or even taking path elements into account.

This implies that the remapping should be applied before rustc makes them absolute, and rustc should only make them absolute if the output of the remapping is not already absolute (since remapping ./foo/ to /an/absolute/path should be perfectly reasonable).

I think that's a much simpler model to reason about and use. If I think about how I want to use this feature, any path processing on rustc's part makes things more complex to reason about and doesn't solve any of my problems.

Edit: By which I mean, any path processing can be done by tooling outside rustc - if I want normalized paths, I can normalize them. If I want them relative, absolute, etc, etc, I can do that outside, so long as I know that rustc isn't going to do anything complex/clever with them. The more complexity rustc applies, the fewer options I have.

jsgf on 18 Jan 2017

I agree that predictability is the most important thing here. I thought that scheme I proposed above would provide that best but with corner cases cropping up, I don't think that anymore.

Let's try to come up with a different rule set:

All paths provided by the user are preserved the way they are passed in (as @jsgf suggests).
Paths derived from a user-provided path also keep the same format, e.g., if I do rustc ./main.rs, I'll get ./mod1/sub.rs. If I do rustc main.rs, I get mod1/sub.rs.
If a path isn't already absolute, it's absolute variant is constructed as getcwd() + user-provided-path. So ../src/main.rs will become /home/foo/project/build/../src/main.rs, for example.
At least in the beginning, the compiler will emit absolute paths everywhere (as it does now, see #34187. This might become optional once remapping is stable).
Remapping is the last thing that is done before storing a path in debuginfo, so you get DW_AT_comp_dir=remap(pwd), and DW_AT_decl_file=remap(abs_path).

How does that sound?

(cc @luser, who might also be interested in this whole topic)

michaelwoerister on 18 Jan 2017

I think that the mappings should be applied last, after any other processing such as converting to absolute paths. I'd imagine that this would be more predictable, at least from an outsider that is merely observing what rustc does without reading its source code - in other words, one could build both with and without the mappings, and the output would be related in a way that is only based on the mappings and the algorithm, and not on anything else rustc might do now or in the future.

In (3), making all paths absolute, would mean the output can only be reproduced if rebuilders build it in the same path. Is this really necessary? I'd prefer to keep relative paths relative (as GCC does), even if they traverse above cwd. The parent tool that calls rustc could easily add extra ../src=xxx mappings if it feels that this is necessary. The one situation this wouldn't be appropriate, is if these relative paths are generated by rustc itself and are unknown to the parent tool - but does this situation actually occur?

infinity0 on 18 Jan 2017

I think that the mappings should be applied last, after any other processing such as converting to absolute paths.

That's what I suggest.

In (3), making all paths absolute, would mean the output can only be reproduced if rebuilders build it in the same path.

Or if they set up their prefix-mapping to result in the same path, right?

Is this really necessary?

It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.

michaelwoerister on 18 Jan 2017

The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked, so it won't know what absolute path applies at that point, and therefore can't do path remapping in those terms. It does know what paths it's putting on the commandline though, so it can generate remappings in those terms.

@michaelwoerister Good point about derived names; I'd overlooked those. (Mostly because I assume they'd have a common prefix from the perspective of mapping.)

jsgf on 18 Jan 2017

@jsgf

The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked

How do you construct your prefix mapping then?

michaelwoerister on 18 Jan 2017

How do you construct your prefix mapping then?

To make it more concrete: DW_TAG_comp_dir always contains the current directory. If you want to map that to something stable, you have to know its value, right?

michaelwoerister on 18 Jan 2017

All relative. I'm not that interested in remapping comp_dir, since actually is no one canonical path that makes sense in my environment. I'm more interested in remapping the source paths to a relative canonical name within the source tree.

To be specific:

In my environment there's a large build farm
It's using Buck for all building
Buck creates a symlink tree containing all the sources listed as dependencies for a given target, and nothing else so that compilation fails if any sources are referenced that aren't listed as dependencies
The build objects built in the farm are stored in a distributed cache, and reused if you're redoing the same build

The net result is that not only is the current directory some random name, but the path to the sources is prefixed by a buck-generated name. This means that debugging with a cached object will result in meaningless source paths. I want to use prefix remapping to map the path to the source from the symlink tree back to the canonical location in the source tree.

Reproducable builds are of secondary interest because they'd have better caching properties; in that case remapping comp_dir to some fixed (but essentially meaningless) string would help - but only if the rest of the object were bit-for-bit identical.

(Added bonus, I'd like to use the remapped name for error messages, but we can get to that later.)

jsgf on 18 Jan 2017

@jsgf Can you give a small example of what your mapping would like?

michaelwoerister on 18 Jan 2017

It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.

If you're invoking rustc directly then this shouldn't be as much of a problem, since presumably you're not using cargo. If you're using cargo and still need reliable paths you could set CARGO_HOME to a known path and then remap that to a fixed path. We don't have this problem in Gecko (since I fixed the debug info for generics from external crates) because we've vendored all our crates into our source repo, so their source paths are always inside our top source directory.

luser on 18 Jan 2017

Something like:

rustc -Zdebug-prefix-map-from=./buck-out/gen/my/build-target#pic,rlib/ -Zdebug-prefix-map-to=./ --other --options ./buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs

ie, remapping the prefix to ./, with the expectation that this will result in DW_TAG_comp_dir being <some random path>, but the DW_TAG_name for each source file would be ./some/path.rs.

jsgf on 18 Jan 2017

BTW, -Z's behaviour of always emitting the "warning: the option Z is unstable and should only be used on the nightly compiler, but it is currently accepted for backwards compatibility; this will soon change, see issue #31847 for more details" warning makes it a non-starter for this.

jsgf on 18 Jan 2017

@jsgf Couldn't you use something like the following?

rustc -Zdebug-prefix-map-from=`pwd`/buck-out/gen/my/build-target#pic,rlib/ 
      -Zdebug-prefix-map-to=/something/
      `pwd`/buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs

michaelwoerister on 18 Jan 2017

@michaelwoerister There's no shell involved, so no backtick substitution.

jsgf on 18 Jan 2017

BTW, -Z's behaviour of always emitting the [...] warning ...

The plan is to move this to a stable -C flag once we are sure we want to stabilize it. In the beginning it will start out as unstable though (like all new features).

michaelwoerister on 18 Jan 2017

Still really annoying message. Is there -Z yes-I-know?

jsgf on 18 Jan 2017

Is there -Z yes-I-know?

Not that I know of, unfortunately.

michaelwoerister on 18 Jan 2017

There's no shell involved, so no backtick substitution.

The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative (because they are relative to something different for each crate). This is different from C/C++ where the source for templates is always available when they are instantiated.

michaelwoerister on 18 Jan 2017

The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative

Well, in this case they're either relative to the same path (ie, from the same sourcebase), or their path is meaningless (from crates.io, which are all prebuilt).

jsgf on 18 Jan 2017

The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked, so it won't know what absolute path applies at that point [..]

The net result is that not only is the current directory some random name, but the path to the sources is prefixed by a buck-generated name. [..]

If I understand correctly, this is an argument in favour of applying the mappings first, before any other processing? I think this is better fixed in Buck itself, because it is the one that sets up these symlinks. It should know what cwd is used for each invocation of Rust, so it should be able to construct the example maps suggested by @michaelwoerister that contain pwd even without a shell.

Also if one applies the mappings first, it likely would result in a non-existent path. Then it's unclear how you do "other processing" on this. (If there is no other processing, then "first" and "last" are the same, and we're in agreement.)

Tools that run after rustc has emitted the debuginfo, have no control over what rustc does or how this might change over time. So it is more important to keep the expectations here simple. GCC also apply the remapping last (see gcc/dwarf2out.c in dwarf2out_early_finish).

I think that the mappings should be applied last, after any other processing such as converting to absolute paths.

That's what I suggest.

OK, I think I just got confused by the below point:

Is [making all paths absolute] really necessary?

It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.

Could you explain this in some more detail so I/others could think of how to "solve it differently"? For example this:

The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative

Well, in this case they're either relative to the same path (ie, from the same sourcebase), or their path is meaningless (from crates.io, which are all prebuilt).

this sort of makes sense to me, but I'm not sure what the details are.

infinity0 on 19 Jan 2017

@infinity0 Regarding relative paths versus extern crates, let's say you have the following setup:

/libfoo/src      <-- contains the source of libfoo
/libfoo/build    <-- this is where you build libfoo & cwd of rustc while doing so

If you have a generic function func in libfoo, it's source location would be recorded as ../src/lib.rs.

Now, let's say you compile another library, libbar, that references libfoo and has a similar setup:

/libbar/src      <-- contains the source of libbar
/libbar/build    <-- this is where you build libbar & cwd of rustc while doing so

If you use foo::func in libbar you end up with a debuginfo entry that tells you that the source code of foo::func can be found in ../src/lib.rs, but we are relative to /libbar/build now, so the debugger would open /libbar/src/lib.rs and show you some unrelated source code. So, at least for items from external crates, in the general case we have to emit absolute paths.

michaelwoerister on 19 Jan 2017

@infinity0 In principle if all the building is happening on one machine, then Buck could control everything. But if the build is being distributed then Buck on machine A could set up an environment that's consistent with relative paths, but be in a different absolute directory on machine B. The problem that paths produced from getcwd() may be absolute, but they are not canonical. (However, if Buck specified everything as absolute paths then they could be made canonical so that the same path works in all environments via the use of symlinks or similar - so I guess it could generate debug-prefix-path options to remap the abs build-time paths to relative.)

For the same reason, I think using absolute paths for the problem @michaelwoerister mentions above is also wrong. We want canonical paths for those source files, not absolute ones. The conflation between absolute and canonical is where I see problems.

Specifically, if you're trying to reconcile the paths for a libfoo build in the context of a libbar build, then I think it's more correct to use the libfoo's comp_dir and relative source paths to construct paths relative to libbar's comp_dir than to use absolute paths.

jsgf on 20 Jan 2017

Specifically, if you're trying to reconcile the paths for a libfoo build in the context of a libbar build, then I think it's more correct to use the libfoo's comp_dir and relative source paths to construct paths relative to libbar's comp_dir than to use absolute paths.

The weirdness here is for generics--they're not actually compiled until you instantiate them with a concrete type, so if you're using Foo<T> from libfoo it's not compiled until it gets used in libbar. What I implemented a while back was to put absolute paths in the metadata with the bytecode instead of the relative paths (which couldn't be resolved later). All it does is join the relative path with the comp_dir, so it should be functionally the same except for not having it in two separate fields.

luser on 20 Jan 2017

If you have a generic function func in libfoo, it's source location would be recorded as ../src/lib.rs. [and the cwd as /libfoo/build].

The weirdness here is for generics--they're not actually compiled until you instantiate them with a concrete type, so if you're using Foo<T> from libfoo it's not compiled until it gets used in libbar.

These two comments sound inconsistent - in the first one, where is the source location recorded? But the second comment suggests this is not available?

Anyway, using the example with libfoo and libbar directly above again, there are two cases:

libfoo::x is compiled with cwd=/libfoo/build and name=../src/x. Then we compile libbar. In this case what @jsgf suggested seems sensible to me, i.e. store the name of x in libbar as relpath("/libfoo/build" + "../src/x", "/libbar/build") which would be ../../libfoo/src/x.
libfoo::x is not compiled at first. Then we compile libbar. In this case, libbar has to find x.rs somehow (or some other intermediate file, I don't know) in which case it can still use a relative path as its name?

In both cases we are assuming that /libfoo and /libbar's relative positions to each other are fixed across both builds (even if their absolute paths change). If this is not the case, then at the very least libbar has to "find" libfoo somehow. Whichever directory this "find" algorithm returns, (it could be /libfoo or /libfoo/build, I don't know), call this directory d, then you could additionally store relpath(d, cwd) somewhere in the rust metadata when building libfoo, so that the later libbar could still work with relative paths. It would be slightly more complex, but still easily achievable IMO. (And none of what I described involves canonicalising symlinks, which would mess with Buck.) I think this addresses the distributed scenarios that @jsgf described too, but I'm not familiar with the details so perhaps he could confirm that.

(edit: explain why I think the two comments sound inconsistent)
(edit: to clarify, by relpath(x, y) I mean the path from y to x, like python's os.path.relpath)

infinity0 on 20 Jan 2017

libfoo::x is not compiled at first. Then we compile libbar. In this case, libbar has to find x.rs somehow (or some other intermediate file, I don't know) in which case it can still use a relative path as its name?

Sorry, I should have been clearer! The generic code gets converted into some sort of bytecode (I'm sure someone else knows the specifics here), and that's stored in the generated rlib along with some metadata (which I think is handled by librustc_metadata). If you list the contents of an rlib with ar t foo.rlib you'll see that it contains a .o file (the actually compiled bits of the crate) a rust.metadata.bin file and a .bytecode.deflate file. The compiler gets the filename from the metadata for the items it's instantiating from bytecode.

Regardless, given that it has a full path in the metadata, I don't see why getting a relative path to the libfoo comp_dir and then joining the relative path would be better than just taking a relative path from the libbar comp_dir to the full path from libfoo. In your example, we'd have libfoo::x being compiled with relpath("/libfoo/src/x", "/libbar/build") which would still be ../../libfoo/src/x.

luser on 20 Jan 2017

In both cases we are assuming that /libfoo and /libbar's relative positions to each other are fixed across both builds.

This seems to be the main problem with this approach. I would be wary of positing that (1) the rule set must support situations where the compilation directory is not known, but (2) it can be assumed that relative positions never change. That seems too tailored to this specific situation to make for a good general rule. Also if there is no common prefix between two paths (as in D:\foo and X:\bar), there is no relative path between them and you have to know the original compilation directory of your upstream crate again to set up a mapping.

@jsgf How about if we introduce a variable __RUSTC_CWD to the mapping syntax? As in:

rustc -Zdebug-prefix-map-from=__RUSTC_CWD/./buck-out/gen/my/build-target#pic,rlib/ 
      -Zdebug-prefix-map-to=/something/
      ./buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs

A variable like this would be guaranteed to match the prefix of any absolute path the compiler emits.
Items from upstream crates would have their paths map with the mapping given when that upstream crate was compiled originally.

michaelwoerister on 20 Jan 2017

Sure, using the full path to lib.rs would also work. I didn't realise that was available; I was only following the earlier constraint "it's source location would be recorded as ../src/lib.rs". As long as we do get the correct final relative paths, how it's calculated is not so important to me.

However, my overall motivation is to avoid having absolute paths anywhere in the output, both debugging output but also in the rlibs, since they can be installed onto end-user systems. So that's where the second part of my suggestion comes in, that relates to how the "find" algorithm for how other crates are found. Suppose we call d instead crate_dir for clarity, then here's a concrete example:

First we build libfoo with prefix map /path/to/libfoo=/usr/src/rust/libfoo
~~
libfoo/build/foo.rlib#libfoo.o:
comp_dir = /usr/src/rust/libfoo/build, mapped from /path/to/libfoo/build
name = ../src/lib.rs
libfoo/build/foo.rlib#rust.metdata.bin:
crate_dir = /usr/src/rust/libfoo, mapped from /path/to/libfoo
fn run_foo_generically:
path = ../src/lib.rs
~~
This is reproducible regardless of the rebuilder's build directory, as long as they set the right prefix-map.

Later on a different build machine, someone else builds libbar, using libfoo from /my/crates/libfoo, with no prefix map because they don't care about reproducibility, but they do care about correct debugging:
~~
libbar/build/bar.rlib#libbar.o:
comp_dir = /my/own/libbar/build
name = ../src/lib.rs
fn run_foo_generically:
path = /my/crates/libfoo/src/lib.rs
^ probably this is not exactly how things work, but hopefully "similar enough" that you get the idea
~~

And this last value /my/crates/libfoo/src/lib.rs would be calculated via:

joinpath(
- relpath(
- foo's comp_dir,
- foo's crate_dir),
- run_foo_generically's (rel)path (from rust.metadata.bin) )

(This assumes that relpath(comp_dir, crate_dir) is reproducible and not something like ../../../lol/trolls/gonna/troll/libfoo/build.)

infinity0 on 20 Jan 2017

Actually, even in the case that someone does want to build libfoo somewhere completely random, they could build it using absolute source path names (which might get remapped). Later, anyone else such as libbar can still recover these source paths relative to the crate_dir, which is all that is needed to correctly resolve the paths on libbar's side.

infinity0 on 20 Jan 2017

@michaelwoerister:

How about if we introduce a variable __RUSTC_CWD to the mapping syntax?

After going to the effort of adding the -from/-to variants to the command-line syntax to avoid needing to parse the string for a separator character, I don't think adding in metasyntactic variables is very consistent with that.

But more generally, I think there's a somewhat irreconcilable problem: sometimes absolute paths are the right thing to use, and sometimes relative is, on a purely case-by-case basis.

(My background summary for my own benefit)

For code that compiled directly to pure object file, the answer is pretty clear: each object has a corresponding set of files with a DW_AT_comp_dir and source paths relative to that. Tools can try to construct their own abspath by combining them, but also construct relative paths purely from the source names, or by having their own source_path to apply to each filename.

The problem arises from code that isn't completely generated into object code at its own build time, but defers it to later specialization in the context of some other module.

C/C++ has no problem with this, because this is always performed in source terms; the second compilation always needs to refer to the source of the first compilation, so we know they're at least being compiled in the same namespace, and the DW_AT_comp_dir+relative path will resolve to something meaningful.

In Rust it's trickier because there's always a compilation to a form of object file (either .rlib or .so), so the original sources are never needed, and the second compilation could be in a completely different filesystem namespace, making the concept of "path to source" for the first compilation potentially meaningless. Or they could be in the same namespace, but with no meaningful relative relationship. Or they could be in the same source tree, but the absolute position of that tree might be different from build to build (or between builder and editor).

So, given that, the questions that occur to me are:

Does Dwarf have a way of expressing what we want here?

That is, "code from comp_dir=.../libfoo src path ../src/thing.rs inlined and specialized into comp_dir=.../libbar src path some/other/path.rs"

How does C/C++ handle this with

precompiled headers?
C++ modules?
Link-time whole-program optimization?

jsgf on 20 Jan 2017

Hm, on closer inspection, it looks to me like DW_AT_comp_dir and DW_AT_name for the source are a red herring; rustc only seems to generate an entry for the top-level lib.rs, and the rest of the sources don't appear there.

The real action is happening in the "Directory Table" (in readelf output, include_directories in the DWARF spec) and the "File Name Table" (file_names). It seems that rustc always generates bare names in the file name table (lib.rs), and generates a new directory entry for each path within the crate, as full paths (/my/source/is/here/libfoo/src/submodule).

The DWARF spec says the file names are relative to either DW_AT_comp_dir or a specific entry in the directory table.

So, for example, libfutures.rlib has:

  The Directory Table (offset 0x1b):
  1     /my/full/path/to/futures/src
  2     /my/full/path/to/futures/src/future
  3     /my/full/path/to/futures/src/stream
  4     /my/full/path/to/futures/src/sink
  5     /my/full/path/to/futures/src/task_impl
  6     /my/full/path/to/futures/src/sync
  7     /my/full/path/to/futures/src/sync/mpsc
[...]

 The File Name Table (offset 0x5f1):
  Entry Dir     Time    Size    Name
  1     1       0       0       lib.rs
  2     2       0       0       mod.rs
  3     2       0       0       lazy.rs
  4     0       0       0       <std macros>
  5     2       0       0       flatten_stream.rs
  6     2       0       0       join.rs
  7     2       0       0       select.rs
  8     2       0       0       chain.rs
  9     2       0       0       join_all.rs
  10    2       0       0       select_all.rs
  11    2       0       0       select_ok.rs
[...]

This means that lib.rs is relative to /my/full/path/to/futures/src, etc. I think rustc is wrong here. It should only have one path to this crate's sources, and then each name in the filenames list should be relative to that. For example:

Assuming DW_AT_comp_dir is /my/full/path/to/futures/src:

  The Directory Table (offset 0x1b):
  1     /my/full/path/to/futures/src
[...]

 The File Name Table (offset 0x5f1):
  Entry Dir     Time    Size    Name
  1     1       0       0       lib.rs
  2     1       0       0       future/mod.rs
  3     1       0       0       future/lazy.rs
  4     0       0       0       <std macros>
  5     1       0       0       future/flatten_stream.rs
  6     1       0       0       future/join.rs
  7     1       0       0       future/select.rs
  8     1       0       0       future/chain.rs
  9     1       0       0       future/join_all.rs
  10    1       0       0       future/select_all.rs
  11    1       0       0       future/select_ok.rs
[...]

so that all the names are also sensible relative to DW_AT_comp_dir as well as having an absolute path. (This is post remapping if the actual build happened in a separate build dir.)

What's more, code from other modules can use different directory entries:

 The Directory Table (offset 0x1b):
[...]
  8     /my/path/to/rust/1.14/src/rust/src
[...]

 The File Name Table (offset 0x5f1):
  Entry Dir     Time    Size    Name
[...]
  41    8       0       0       liballoc/arc.rs
[...]

So I think this goes back to @michaelwoerister's comment above about how rustc should construct the pathnames to emit based on the top-level source passed to rustc, and make sure that it propagates that all the way through to the DWARF info without rewriting them as absolute dir + filename.

jsgf on 20 Jan 2017

@jsgf The other files appear in DW_AT_decl_file attributes. This is how DWARF encodes paths. Paths in the file name table are relative to their directory table entry, paths in the directory table are relative to the DW_AT_comp_dir of the compilation unit. It's done in a way that maps well to C include directories but there is no clear rule how this should be used (which doesn't mean that the way rustc does it now is a good one).

michaelwoerister on 20 Jan 2017

@michaelwoerister DW_AT_decl_file compute the file index for a piece of the code, but the ultimate filename comes from the file name and directory tables.

jsgf on 20 Jan 2017

@infinity0 In your example from above, when compiling libbar, I don't think the compiler would be able to know that the source of libfoo could be found under /my/crates/libfoo. It only knows where libfoo.rlib is, which is independent of libfoo's source location.

michaelwoerister on 20 Jan 2017

DW_AT_decl_file compute the file index for a piece of the code, but the ultimate filename comes from the file name and directory tables.

Yes, sorry, that didn't make too much sense. The important thing is that the file-name and directory tables are just an arbitrary encoding. It seems that GCC for example does the same as rustc (maybe because it takes the least space).

michaelwoerister on 20 Jan 2017

@michaelwoerister I see, OK. This situation is indeed not handled well by DWARF. But I think it's possible to make it work, going back to the crate_dir concept. Instead of recording paths to libfoo relative to libbar, you could record these paths as a special dummy path like crate://libfoo/$relative_path, where $relative_path is as I suggested above, calculated using crate_dir and other information in libfoo.rlib.

I think this sort of solution is inevitable given the differences from C/C++ that @jsgf just talked about in detail. If I build libbar against libfoo.rlib but I might not have the latter's source code, but I want my users to potentially be able to get its source code to debug my library, then the only theoretically possible thing to do is to refer to a relative path, that is relative to some "abstract idea" of where the source code of libfoo might be.

Anyway, this is going slightly away from the topic of prefix maps. To summarise, I think we should separate the concerns here:

to support debugging source code from other libraries libfoo (whose source code might not be available at build time, but debugging symbols are), this can be achieved by recording crate_dir and using virtual paths (e.g. crate://libfoo/path/to/file/containing/a/generic/symbol)
to support reproducible builds, this can be achieved by using prefix-maps to transform paths as usual, e.g. crate_dir, comp_dir, name, those things in the Directory Table, etc.

Actually, if it's possible/suitable for rustc to auto-detect the top-level crate_dir of any given file that it's compiling, these prefix-maps aren't even necessary at all - all paths can simply be made relative to this directory, and debuggers can recreate these paths on their side no problem. Prefix-maps are only really needed when the build tool does not know the "top-level" directory of the source code and needs a parent buildsystem to pass it in, which is the case for traditional C compilers but not usually the case for more modern languages.

infinity0 on 21 Jan 2017

Perhaps I missed this, but is there a reason that we can't remap the paths when they are being stored into an rlib, to avoid the need to map them later on? ie: do the remap at the point where the source location is known.

By doing this, non-cargo builders have the control they need, and we'd just need some higher level support for path-remapping in cargo to handle it's multiple rustc invocations.

jmesmon on 22 Jan 2017

@infinity0 Your idea of "abstract source" matches up with what I had in mind as the primary motivation for prefix mapping and can also be implemented via regular prefix mapping if we do what @jmesmon suggests:

When you invoke the compiler something like this rustc ../src/lib.rs, you can have a mapping ../src/ -> [email protected]/ (if you are using relative paths) or /absolute/path/to/your/src/ -> [email protected]/ where [email protected] is a kind of abstract name for the given source code. In a later step the consumer (like GDB) can have it's own mapping for that or you remap it again with debugedit. For this to work, we need to do what @jmesmon says: remap paths as they are stored or -- equivalently -- store the mapping with the crate and apply it to anything that comes from that crate. This way, things coming from other crates potentially are already in the known "abstract source space" and you know what to do with them.

michaelwoerister on 26 Jan 2017

UPDATE

Below is a description of the new mapping algorithm I plan to implement. I think it is a synthesis of the insights that emerged from the discussion above and I also think that it meets the different requirements. Please let me know if you have any feedback!

Requirements

Debuginfo for inlined functions still just works out-of-the-box.
Paths can be mapped into a "common source space".
Paths can be mapped irrespective of the compiler's working directory.

Key differences to previous proposals

The mapping is applied separately to compiler working directory and file paths in order to achieve goal (3). This is what GCC et al do. A previous proposal of mine would first concatenate the two and apply the mapping afterwards.
This approach always applies the mapping that was specified for the crate that contains the mapped path. This is what makes it feasible to skip making paths absolute before mapping them and still support inlined functions.
Paths for inlined items are conditionally made absolute in order to achieve goal (1). If a path has been modified by a mapping, however, it will not be made absolute and it's up to the user to get things right. This behavior could potentially be surprising, it'll need to be well documented.

Pseudocode

fn map_path_for(path, def_id)
{
    // Always use the mapping that was defined for the given crate. A DefId
    // knows which crate it's coming from.
    let mapping = get_mapping_for(def_id);

    let (mapped_path, was_affected_by_mapping) = mapping.map(path);

    if def_id.is_from_local_crate()
    {
        // For the local crate we are done because all inputs are relative to
        // the working directory of the current compiler invocation
        return mapped_path
    }

    if path.is_absolute() 
    {
        // For absolute paths there's nothing more to do since those
        // don't depend on the working directory of any compiler invocation
        return mapped_path
    }

    // At this point we know that we have the path for something that has
    // been inlined into the current crate and it has a path that is relative
    // to the compiler's working directory when the upstream crate was
    // compiled.
    // We need to find out if the path that the debugger is going to look up
    // was in any way affected by path-remapping: 
    // - If yes, we assume that this was intentional and don't mess with it.
    // - If no, we need to make the path absolute so that debuggers can still 
    //   find it.

    if was_affected_by_mapping || cgu_working_dir_is_affected_by_mapping(def_id)
    {
        // Assume the user knows what they are doing
        return mapped_path
    }
    else 
    {
        // Make absolute so that the debugger can find the source file
        return get_cgu_working_dir_for(def_id).join(mapped_path)
    }
}

Examples

Crate Graph

    A   B
     \ /
      C        ... C depends on A and B

Build Setup

CRATE A
working dir = /crates/A/build
src = /crates/A/src/lib.rs

CRATE B
working dir = /crates/B/build
src = /crates/B/src/lib.rs

CRATE C
working dir = /crates/C/build
src = /crates/C/src/lib.rs

Example with relative paths given to compiler

What the compiler sees:

CRATE A
working dir = /crates/A/build
src = ../src/lib.rs

CRATE B
working dir = /crates/B/build
src = ../src/lib.rs

CRATE C
working dir = /crates/C/build
src = ../src/lib.rs

Example 1, build directory independent configuration:

Mapping for A: ../src => /mapped-src-of-A
Mapping for B: ../src => /mapped-src-of-B
Mapping for C: ../src => /mapped-src-of-C

RESULTS:
working directory = /crates/C/build
function in C, defined in /crates/C/src/foo.rs => /mapped-src-of-C/foo.rs
function in C, imported from A, defined in /crates/A/src/bar.rs => /mapped-src-of-A/bar.rs
function in C, imported from B, defined in /crates/B/src/baz.rs => /mapped-src-of-B/baz.rs

Example 2, map all crates into "common source space":

Mapping for A: /crates/A/build => common-src-root
               ../src          => A/src
Mapping for B: /crates/B/build => common-src-root
               ../src          => B/src
Mapping for C: /crates/C/build => common-src-root
               ../src          => C/src

RESULTS:
working directory = common-src-root
function in C, defined in /crates/C/src/foo.rs => C/src/foo.rs
function in C, imported from A, defined in /crates/A/src/bar.rs => A/src/bar.rs
function in C, imported from B, defined in /crates/B/src/baz.rs => B/src/baz.rs

Example with absolute paths given to compiler

What the compiler sees:

CRATE A
working dir = /crates/A/build
src = /crates/A/src/lib.rs

CRATE B
working dir = /crates/B/build
src = /crates/B/src/lib.rs

CRATE C
working dir = /crates/C/build
src = /crates/C/src/lib.rs

Example 1, build directory independent configuration:

Mapping for A: /crates/A/src => /mapped-src-of-A
Mapping for B: /crates/B/src => /mapped-src-of-B
Mapping for C: /crates/C/src => /mapped-src-of-C

RESULTS:
working directory = /crates/C/build
Function in C, defined in /crates/C/src/foo.rs => /mapped-src-of-C/foo.rs
Function in C, imported from A, defined in /crates/A/src/bar.rs => /mapped-src-of-A/bar.rs
Function in C, imported from B, defined in /crates/B/src/baz.rs => /mapped-src-of-B/baz.rs

Example 2, map all crates into "common source space":

Mapping for A: /crates/A/src => common-src-root/A/src
Mapping for B: /crates/B/src => common-src-root/B/src
Mapping for C: /crates/C/src => common-src-root/C/src

RESULTS:
working directory = /crates/C/build
function in C, defined in /crates/C/src/foo.rs => common-src-root/C/src/foo.rs
function in C, imported from A, defined in /crates/A/src/bar.rs => common-src-root/A/src/bar.rs
function in C, imported from B, defined in /crates/B/src/baz.rs => common-src-root/B/src/baz.rs

Again, please let me know if you have any feedback!

michaelwoerister on 18 Apr 2017

It looks like pseudo code should have return path replaced with return mapped_path? Otherwise no mapping is going to take place :)

On making the paths for inlined functions absolute: is that something that gcc/clang/other compilers do? It seems like relative paths could still be correct there (ie: relpaths can work in the case where the whole source+binary are moved together, while abspaths can work when source is not moved by binary is). This seems minor though.

Otherwise, this sounds like it's at the point where we just need to try it out 👍

jmesmon on 19 Apr 2017

It looks like pseudo code should have return path replaced with return mapped_path? Otherwise no mapping is going to take place :)

You might be on to something there :) I fixed the code snippet, thanks!

On making the paths for inlined functions absolute: is that something that gcc/clang/other compilers do?

No this is something that is specific to Rust. In C/C++ each compiler invocation has the correct relative paths available because of header files. In Rust we don't have header files and don't have a connection between an rlib and the source it was compiled from.

michaelwoerister on 19 Apr 2017

I'd imagine that lto in gcc/clang could mirror (the path mapping issues with) rust's function inlining as in the lto case the functions to be inlined (if the compiler chooses to do inlining) would be from another translation unit.

jmesmon on 19 Apr 2017

I'm still digesting the above, but it looks good so far.

Wishlist request: an option to make paths/filenames in compiler messages the remapped version, rather than the input version.

jsgf on 21 Apr 2017

👍1

Triage: there's been a ton of discussion on this issue, but I have no idea what the current state of it is. can anyone summarize?

steveklabnik on 24 Sep 2018

41555 was the tracking issue and it's closed now.

We are using --remap-path-prefix in Debian and it works to make ripgrep reproducible.

rustc itself is still not reproducible however, and I don't know why. That is tracked in #34902. There have been quite a few regressions.

infinity0 on 25 Sep 2018

Yes, it's available on stable so I'm closing this issue.

michaelwoerister on 25 Sep 2018

👍1

Rust: Allow remapping source path prefixes in debug output

All 73 comments

UPDATE

Requirements

Key differences to previous proposals

Pseudocode

Examples

Crate Graph

Build Setup

Example with relative paths given to compiler

Example with absolute paths given to compiler

41555 was the tracking issue and it's closed now.

Related issues