Rust: syntax_pos::symbol::Symbol::gensym() is incompatible with stable hashing.

Created on 23 Mar 2018  路  19Comments  路  Source: rust-lang/rust

The current scheme for "gensym"ing names via Symbol::gensym() is rather clever and elegant but unfortunately it defies being hashed in a stable way as is needed for incremental compilation. So far, incr. comp. has erroneously treated differing Symbols with equal string contents as being the same. Now situations are starting to arise where this leads to problems (https://github.com/rust-lang/rust/issues/48923, https://github.com/rust-lang/rust/issues/49065).

I consider this a somewhat urgent issue.

cc @rust-lang/compiler, esp. @petrochenkov & @jseyfried

A-incr-comp A-macros A-resolve A-syntaxext C-bug P-medium T-compiler

Most helpful comment

We should investigate implementing gensyms via SyntaxContext (which means via Span after #49154).
In theory, every situation when symbol != gensym(symbol) makes sense should include hygiene into comparison as well.
If "gensymness" is kept in SyntaxContext side tables somehow, then symbol1 == symbol2 will always be equivalent to string_content(symbol1) == string_content(symbol2), that also looks like a simplification of the mental model to me.

All 19 comments

I generally consider gensym problematic and would prefer it if all hygiene was done through Span.
There are maybe a few non-hygiene uses but they should probably be done in other ways.

We should investigate implementing gensyms via SyntaxContext (which means via Span after #49154).
In theory, every situation when symbol != gensym(symbol) makes sense should include hygiene into comparison as well.
If "gensymness" is kept in SyntaxContext side tables somehow, then symbol1 == symbol2 will always be equivalent to string_content(symbol1) == string_content(symbol2), that also looks like a simplification of the mental model to me.

Blocked on removing emulation of hygiene with gensyms (can't emulate gensyms with hygiene if we are doing exactly the opposite thing right now!) (https://github.com/petrochenkov/rust/tree/spident2, in progress), which in its turn is waiting for https://github.com/rust-lang/rust/pull/49154 and https://github.com/rust-lang/rust/pull/49718.

Could gensym generate a name with a stable prefix (proc macro crate name + version), a stable ID (proc macro invocation local) and some character that is illegal in names, so noone can name it? Maybe even throw in the stable hash of the code it is invoked upon?

Categorizing as P-high so that this keeps annoying us, even though we don't have anyone assigned to it or actively working on it.

We've been coming back to this issue in the @rust-lang/compiler meeting without progress for a few weeks in a row now. It would be great if we could come up with some action items. @petrochenkov , the two blocking PRs you mentioned above have be merged in the meantime. What would be the next steps? Or broader question: Do we know what to do and just need to implement it or are there still open questions about how this can be solved?

@michaelwoerister

Or broader question: Do we know what to do and just need to implement it or are there still open questions about how this can be solved?

With regards to the first part (removing emulation of hygiene with gensyms) we know what to do and just need to implement it.
I've opened https://github.com/rust-lang/rust/pull/51072 that does this for fields, but there are also associated items and lifetimes.
This is mostly a mechanical work, but it's also a great opportunity to do audit of identifier use and fix hygiene bugs (this part is not so mechanical).

The good news is that I have a few free days before I leave on vacation on May 31, and I was going to spend them on Macro 1.2 stuff, but there's a good chance I'll have time left on Identification of HIR.
If something remains unfinished after May 31 someone else can do it without identifier audit.

With regards to the second part, I didn't yet think about how exactly gensyms should be encoded in HygieneData structures so they are compatible with stable hashing. I'm not sure what are requirements - stable representation for different compilation sessions for the same crate, different compilation sessions for different crates?

That sounds like good news, thanks @petrochenkov!

Regarding stability: The main requirement is that a gensymed symbol hashes the same way in two different compilation sessions regardless of what other HIR items have been added before or after it. If something in the item that contains the symbol has changed, then the hash can also change.

An example what we need for stable hashing are HirIds. They consist of a DefIndex (for which we hash the DefPath which does not change, even if we insert other items) and an ItemLocalId, which just numbers all nodes within the given item.

visiting for triage.

assigning to @michaelwoerister to ensure that it gets addressed (potentially via further delegation).

Some status update: https://github.com/rust-lang/rust/pull/51952/commits/94ef9f57f5fa985beb7588e5cb4c73f1b5f2dcba was merged, so all macros (including legacy builtin ones) can now easily generate identifiers with def-site hygiene.
So most of gensyms (although perhaps not all of them) should now be replaceable with hygienic identifiers.

Thanks for the awesome work so far, @petrochenkov!

The current situation is that we don't use gensym anymore but we are still not hashing things properly: https://github.com/rust-lang/rust/pull/51492#issuecomment-399971873

I haven't had time to think about the various solution strategies yet.

visiting for triage. No progress yet. Should consider either reassigning or lowering priority if we are still stalled next week.

The problem remains, but at least we have some kind of assertion that should catch problems related to it. Going to downgrade to P-medium for now.

... but at least we have some kind of assertion that should catch problems related to it

In particular, this one:
https://github.com/rust-lang/rust/blob/fbb6275f4fd6cf774e1789fabfacae7248c45021/src/librustc/ty/query/plumbing.rs#L526-L535

I haven't seen this written out explicitly, but expansion/hygiene data can be stored in metadata/incremental-data in a stable way using the same DefPath approach as other things.

The basic thing to encode here is ExpnId and it has a well-defined parent DefIndex (currently available through the Definitions::invocation_parents table).
If multiple expansions share the parent def-path, then, I think, they should be able to just get local disambuguators.

mod m {
    fn foo() { // def-path: m::foo
        // Here's how the closures' `NodeId`s get a stable representation.
        let closure1 = || {}; // m::foo(1)
        let closure2 = || {}; // m::foo(2)

        // Same idea, but for macro invocations and their `ExpnId`s instead of `NodeId`s
        mac1!(); // m::foo!(1)
        mac2!(); // m::foo!(2)
    }
}

If we can encode ExpnId, then we can encode SyntaxContext as well (a chain of ExpnIds), then Span and everything else.
ExpnId is the only ID-like thing in this area, other components of hygiene structures are plain data.


https://github.com/rust-lang/rust/pull/63919 should be one of few last steps in the removal of gensyms, after that we'll only have hygiene, which should be encodable as described above, if I'm not missing anything.

Hm, making macro invocations a regular DefPath component... I haven't thought about this deeply yet but it might be genius :) Then DefPath could already be assigned during parsing, with macros just adding more of them without conflicts. cc @eddyb @nikomatsakis

@michaelwoerister Yeah, that's also roughly my vision for moving incremental further back, I think I discussed it with @nikomatsakis before.

I was supposed to write about this months ago but other things came up.

In short, there'd be "Def" layer that parsing, expansion and cross-crate metadata would all interact with, with AST/HIR remaining mainly for the intra-definition syntactic categories and miscellaneous syntax.

A first step could be splitting HIR into "HIR Def" and "everything else", but moving def paths by themselves further back would also make sense.

Def paths are currently created during expansion, when a freshly expanded AST fragment is integrated into the partially built AST.
That's already early enough for representing expansions with def paths.

馃帀 馃帀 馃帀 馃帀 馃帀

Was this page helpful?
0 / 5 - 0 ratings