Roslyn: Comments and whitespace should not influence determinism

Created on 7 Dec 2015 · 35Comments · Source: dotnet/roslyn

Comments and white space should not change the deterministic build output of an assembly. Right now it is doing so indirectly via the PDB hash.

In order to calculate a deterministic PDB guid we hash all of the information sent to the PDB. That hash is then broken up into the guid value used as the ID. This id is embedded into the DLL / EXE and thus code input which changes the PDB ID, also impacts determinism.

Presently we are including the checksum of every file in the compilation in the PDB hash

https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/NativePdbWriter/PdbWriter.cs#L980

This means any white space or comment changes will indirectly impact determinism. This block should be removed.

Area-Compilers Bug Concept-Determinism Resolution-By Design

Source

jaredpar

All 35 comments

Comments and white space should not change the deterministic build output of an assembly

Are you sure about that? What about things like CallerLineNumberAttribute. Won't that cause your line number to be embedded into the code. And the line number could certainly be affected by comments/whitespace.

CyrusNajmabadi on 7 Dec 2015

@CyrusNajmabadi such a change though will result in different IL because the constant used as the default value will change. This issue is calling out that white space changes, by themselves, should not be causing differences in the deterministic output.

jaredpar on 7 Dec 2015

Why? The PE file contains the PDB ID. The PDB has to include hash of every source file. Also sequence points likely change if you change lines and columns of statements.

tmat on 7 Dec 2015

@tmat changing comments and white space do not effect the behavior of my DLL. Hence why should I be re-running a full test suite if I fix a typo? It's a waste of processor and developer time. The deterministic build should only change ideally when the DLL has functionally changed.

jaredpar on 7 Dec 2015

@jaredpar But it affects debugging behavior. The assembly contains debugging information. In your specific scenario whitespace doesn't matter, but in other scenarios it does.

tmat on 7 Dec 2015

@tmat the PDB is still produced hence debugging will still function.

jaredpar on 7 Dec 2015

@jaredpar But the PDB is pointed to by the PE file. And the pointer includes the hash. Are you proposing to remove it?

tmat on 7 Dec 2015

@jaredpar or to not include the PDB ID in the PE hash?

tmat on 7 Dec 2015

@tmat the hash will still be included, I just don't want artifacts of the PDB which don't effect execution to be included in the GUID

jaredpar on 7 Dec 2015

@jaredpar Sorry that can't be done. The debugger requires it.

tmat on 7 Dec 2015

@tmat disagree. I don't see the debugger taking a dependency, or really an interest, in how we generate the content ID of a PDB.

jaredpar on 7 Dec 2015

@jaredpar We can potentially exclude the PDB ID from PE hashing. But then the GUID no longer represents the whole content of the file.

tmat on 7 Dec 2015

@jaredpar Well, either the PDB ID won't represent the content of the PDB file or the PE GUID won't represent the content of the PE file.

tmat on 7 Dec 2015

Also in case of Portable PDB we hash the raw content of the PDB file. Again the PDB contains a table of documents each has a hash of the corresponding source file.

tmat on 7 Dec 2015

@tmat what is your alternative suggestion? That determinism changes if any non-functional artifact of the PE changes? That seems like a huge waste of time and resources.

Particularly evidenced by the 4 hours I wasted this weekend tracking down the time stamp we included in generated files. This in no conceivable way effected the functionality of my program. Yet determintism kept producing different results. Had I not been a member of the compiler team what would my options have been (other than to think this was a busted feature)?

jaredpar on 7 Dec 2015

You'd report a bug on the generator, since the generator was non-deterministic. If the compiler gets non-deterministic inputs it will produce non-deterministic outputs. The generator could have also generated some non-deterministic IL/metadata that doesn't affect execution (unused constants e.g.). So are we excluding those too?

tmat on 7 Dec 2015

@tmat how was I to find that the generator had a bug? All I see is the same input going in, determinism producing different values with literally 0 difference in the IL? As a customer I'm just going to file a bug.

jaredpar on 7 Dec 2015

@jaredpar That's a good question. In general customers will have complex builds and making sure that every step uses deterministic tool is a challenge. Perhaps we should provide some tooling to help diagnosing such issues.

tmat on 7 Dec 2015

No.

tmat on 10 Dec 2015

I guess if the build system takes advantage of reference assemblies (and reference assemblies have no pdbs), then editing whitespace/comments shouldn't affect the bytes of the reference assembly PE. That is the scenario that has to work. I think it's important to invest in pervasive and correct support for reference assemblies in the build systems (MSBuild/Domino), then changing the actual assembly/pdb won't matter as much as the ref assembly will stay the same.

KirillOsenkov on 11 Dec 2015

👍1

+1 for what @KirillOsenkov says

tmat on 11 Dec 2015

So the solution proposed here is:

Tools which have minor white space differences should silently break determinism.
Builds should produce identical reference assemblies but different DLLs.

The end effect is customers will enable determinism, build their repo several times, see the result changes but DLLs are identical. How will they interpret this as anything but a bug in determinism?

What tools are we giving them to track down this problem?

I don't see how this is a good path forward for customers.

jaredpar on 11 Dec 2015

I want to reiterate, I filed this issue because I spent over 4 hours tracking down such a problem. I work on the compiler team and have experience with this feature. I was only able to track this down by instrumenting the compiler with determinism debugging code to figure out where the difference was occurring.

Customers will not be doing this. They will look at the results, conclude this is a broken feature and move onto something else.

jaredpar on 11 Dec 2015

👍1

The path forward would be to validate that tools are deterministic if they claim so in msbuild under some diagnostic switch. So the customer can run msbuild /checkDeterminism Foo.sln and get errors for tools that are not deterministic.

tmat on 11 Dec 2015

How to validate? Again you're asking the customers to debug this themselves. That's not a solution.

jaredpar on 11 Dec 2015

So what do you propose we do about third party tools that customers have in their build systems? If these tools generate random inputs to the compiler the build won't be deterministic no matter what we do in the compiler.

tmat on 11 Dec 2015

@tmat In cases where it causes the DLL / IL to meaningfully differ then customers will be able to use the tools they have available today to track down hte problem. No one expects determinism to produce the same DLLs when the names of members are changing. But when they see no changes to the IL yet different DLLs they will correctly push back that we have a bug.

jaredpar on 11 Dec 2015

Well, then it's about perception and setting expectations. We should inform the customers that _any_ change in source code will affect the result. Tools to diff source code exist.

The documentation /deterministic should simply say "the same input => the same output". Input includes all source files, so I think that's very simple to explain.

tmat on 11 Dec 2015

@tmat this is where I disagree. Typing a space in my project has literally 0 effect on the functionality of the DLL. Making it a requirement to change the SHA when that occurs makes no sense to me. It seems to cut against the intent of this feature.

jaredpar on 11 Dec 2015

I believe the intent of the feature has always been "the same input => the same output" no less and no more. It wasn't "functionally same input => the same output".

tmat on 11 Dec 2015

I would not interpret this as a determinism bug. I would say "well, I changed input, so I get new output".

Determinism has always been, for me, a question of "if I _did_ not change _anything_, then _nothing_ happens". i.e. I need the exact same inputs in order to be able to reuse the cache.

CyrusNajmabadi on 11 Dec 2015

@tmat this is where I disagree. Typing a space in my project has literally 0 effect on the functionality of the DLL. Making it a requirement to change the SHA when that occurs makes no sense to me. It seems to cut against the intent of this feature.

Jared, this sounds like a nice to have _additional_ feature. But to me that was never the requirement of determinism. Determinism was: exact same inputs produce same output. Different inputs _can_ produce the same output, but also will often product different output.

The feature of "different input is guaranteed to have the same output if there are no functional differences" is new, and is not something I expected or felt was necessary.

Note: it's also not an actual requirement of tools like domino or 1ES afaict.

CyrusNajmabadi on 11 Dec 2015

But when they see no changes to the IL yet different DLLs they will correctly push back that we have a bug.

They can definitely push back. But a simply explanation of "you are only guaranteed to get the same output when the input is identical" is a reasonable and acceptable response.

Note: at least when I left Google, this was how the system worked. Inputs were simply hashed, with no regard for if they could otherwise be considered semantically identical. It did not raise the ire of anyone there.

CyrusNajmabadi on 11 Dec 2015

@jaredpar What about changing unused constants or code that gets optimized away?

TyOverby on 14 Dec 2015

We can change the PDB format to stop including a checksum of the source files. If we do that then minor whitespace changes will not affect the PDB. But currently it is part of the PDB spec that such checksums are to be included in the PDB. Consequently the PDB necessarily depends on whitespace changes. We use the PDB ID (which appears in the assembly) to identify a particular PDB. Using the same PDB ID for distinct PDBs that differ would, therefore, undermine the mechanism.

If you want the checksum to be excluded (from PDBs), please file a separate feature request, and we can begin negotiating with the debugger teams.

As it stands, things are behaving by design. This is already documented at https://github.com/dotnet/roslyn/blob/master/docs/compilers/Deterministic%20Inputs.md