Rust: convenient self-profile option?

Created on 15 May 2018 · 16Comments · Source: rust-lang/rust

It'd be really nice if there were an easy way to breakdown compilation time into a few categories, and to get a summary of reuse:

Parsing/expansion
Type-checking, borrow-checking
Codegen (aka trans)
LLVM + linking

Something that we could conveniently copy and paste into bug reports to get an overall sense of what is going. This would replace and/or build on the existing profile options. (Even better would be if we could then have an option to get per-query data, later on.)

cc @nnethercote @michaelwoerister

C-enhancement T-compiler WG-compiler-performance

Source

nikomatsakis

Most helpful comment

I've got a very basic version of this here. Here's a sample of the output:

wesley@endurance:~/code/rust/rustc-serialize> RUSTFLAGS="-Zself-profile" cargo build
   Compiling rustc-serialize v0.3.24 (file:///home/wesley/code/rust/rustc-serialize)
Self profiling results for rustc_serialize:
Parsing                  27             
Expansion                132            
TypeChecking             2199           
    1597268 hits 1666449 queries
BorrowChecking           150            
    10064 hits 14436 queries
Codegen                  2098           
    44011 hits 54580 queries
Linking                  17             
    4724 hits 5660 queries
Other                    601            
    2389409 hits 2461645 queries

Optimization level: No
Incremental: on
    Finished dev [unoptimized + debuginfo] target(s) in 5.36 secs

It might be nice to have a category for "incremental overhead" such as hashing & serializing/deserializing the graph? Perhaps Parsing could also output the number of lines parsed to give a sense for the size of the crate? Are there any other rustc options that would be useful to include in the output?

wesleywiser on 30 May 2018

👍2 🎉1

All 16 comments

Excluding the reuse part, isn't this what -Ztime-passes is for?

I've been told that profiling queries are meant to replace -Ztime-passes, but when I followed the instructions at https://forge.rust-lang.org/profile-queries.html I couldn't get it to work -- it produced empty files.

Here are some ways that I think -Ztime-passes could be improved:

Fewer buckets. There are too many that take up almost no time, and just clutter the output.
Percentages. I don't like having to constantly do percentage math (this stage took 3.7s out of 11.1s) in my head. Parallelism complicates this.
The nesting is unclear. It took me ages to realize that the total for nested buckets shows up below them.
Parellelism messes up the ordering.

A lot of improvements could be made by recording data and then having a post-processing step, which could aggregate insignificant buckets, for example. I don't know whether that post-processing should happen within rustc itself or be a separate tool.

nnethercote on 16 May 2018

@nnethercote I think that "time-passes" and the profile-queries concepts probably ought to be merged, and renamed to something like -Zself-profile. One problem with time-passes is that the compiler is moving towards a model where it doesn't have passes — but we're not there yet. This means that the implementation model of time-passes is pretty broken: it is based around some top-level calls, but in fact those calls can sort of unreliably include a bunch of unrelated work. For example, invoking the MIR borrow checker may also construct MIR, but it's probably not a good idea to account those the same. This is what profile-queries was meant to fix, but as you observed, the impl is broken. I think we should probably fix profile-queries and simplify it — for example, it doesn't seem like we've really had much need for the HTML-based output it produces, and we'd rather just kind of get a flat list of which queries are taking up time. (And yes it'd be nice to group those queries, because often we don't need that much detail.)

I think my preference would be that when we declare queries, we also declare a "phase" or something that is supposed to be used for a coarse-grained dump (and then you could use -Zfull-profile or something if you want the full details).

nikomatsakis on 17 May 2018

I'd like to work on this.

wesleywiser on 17 May 2018

❤1

It would be great if we had a solution that can deal well with parallelism, potentially also showing points of contention. We have the (possibly bitrotted) -Ztrans_time_graph option which generates a diagram that is rather crude but was indispensable for implementing the scheduler correctly. There was also some discussion about generalizing this and using a known output format that can be visualized in existing tools here: https://github.com/rust-lang/rust/issues/43804

That being said, I don't think we should try to fix any of the existing solutions but build a new one. It should support at least a pure text summary format that can easily be collected by users and pasted in bug reports. An additional, more detailed output format that can be easily processed by other tools would be great too.

@nnethercote, are there any lessons to be learned from FF's profiling support? Or the valgrind-based tools?

Also cc the whole @rust-lang/wg-compiler-performance

michaelwoerister on 18 May 2018

Apologies for the slow reply.

Text-based output is indeed extremely convenient for pasting into bugs, etc., so I recommend it whenever it's appropriate. The simplest thing to do is just dump out the final text directly from rustc, as -Ztime-passes currently does, but that's inflexible.

The alternative is to write data to a file and then have a separate viewing tool. If the data has a tree-like structure (as -Ztime-passes already does) then the ability to expand/collapse subtrees is very useful. Firefox's about:memory is a good example of this. Here's some sample output:

397.53 MB (100.0%) -- explicit
├──108.74 MB (27.35%) ── heap-unclassified
├───63.97 MB (16.09%) -- js-non-window
│   ├──39.85 MB (10.02%) -- zones
│   │  ├──35.94 MB (09.04%) -- zone(0x114a80000)
│   │  │  ├──18.46 MB (04.64%) ++ (36 tiny)
│   │  │  └──17.48 MB (04.40%) -- realm([System Principal], shared JSM global)
│   │  │     ├───8.81 MB (02.22%) ++ classes
│   │  │     └───8.67 MB (02.18%) ++ (11 tiny)

It's all text so it can be copy and pasted easily. (The lines are Unicode light box-drawing chars u2500, u2502, u2514, u251c, which usually render nicely in terminals and text boxes.) But it's also interactive: when viewing it within Firefox you can click on any of those lines that have "++" or "--" to expand or collapse a sub-tree. Also, by default it only shows as expanded the sub-trees that exceed a certain threshold; this is good because you can collect arbitrarily detailed data (i.e. as many buckets, sub-buckets, sub-sub-buckets, etc., as you like) without it being overwhelming.

I would recommend using a web browser as your display tool. One possibility is to generate an HTML page(s) directly from rustc, as is done by the profile-queries. Another possibility is to generate some other intermediate format (JSON?) and then write a webpage/website that reads that in, as is done by rustc-perf's site. The latter approach might be more flexible.

nnethercote on 26 May 2018

I've got a very basic version of this here. Here's a sample of the output:

wesley@endurance:~/code/rust/rustc-serialize> RUSTFLAGS="-Zself-profile" cargo build
   Compiling rustc-serialize v0.3.24 (file:///home/wesley/code/rust/rustc-serialize)
Self profiling results for rustc_serialize:
Parsing                  27             
Expansion                132            
TypeChecking             2199           
    1597268 hits 1666449 queries
BorrowChecking           150            
    10064 hits 14436 queries
Codegen                  2098           
    44011 hits 54580 queries
Linking                  17             
    4724 hits 5660 queries
Other                    601            
    2389409 hits 2461645 queries

Optimization level: No
Incremental: on
    Finished dev [unoptimized + debuginfo] target(s) in 5.36 secs

wesleywiser on 30 May 2018

👍2 🎉1

That looks very cool! Some feedback:

It would be good to display the time unit (are these ms?).
Is hits cache hits? It might be more readable to display a ratio instead of absolute numbers.
Some other rustc options take the --verbose flag into account. This could be used here to control how much data this provides.
Having more data about incremental would be great. Some of the aggregate data from -Zincremental-info could also be shown here (like CGU re-use, red vs green nodes, etc).

michaelwoerister on 30 May 2018

👍1

This is awesome work @wesleywiser! 💯

I agree with @michaelwoerister's suggestion that we include ms or some other unit. I also think it'd be nice to display the data more "tabularly". Something like this:

| Phase                  | Time (ms) | Queries | Hits (%) |
| -----                  | --------- | ------- | -------- |
| Parsing                | 27        |         |          |
| Expansion              | 132       |         |          |
| TypeChecking           | 2199      | 1666449 | 95.8     |
| BorrowChecking         | 150       | 14436   | 69.7     |
| Codegen                | 2098      | 54580   | 80.6     |
| Linking                | 17        | 5660    | 83.5     |
| Other                  | 601       | 2461645 | 97.1     |

I've gone ahead and made it valid Markdown, too, so that it can easily be pasted for nice display =)

| Phase | Time (ms) | Queries | Hits (%) |
| ----- | --------- | ------- | -------- |
| Parsing | 27 | | |
| Expansion | 132 | | |
| TypeChecking | 2199 | 1666449 | 95.8 |
| BorrowChecking | 150 | 14436 | 69.7 |
| Codegen | 2098 | 54580 | 80.6 |
| Linking | 17 | 5660 | 83.5 |
| Other | 601 | 2461645 | 97.1 |

Finally, I think we display the time as a percentage of the total for each phase.

As a last point, I think it'd be great if — to determine the total time — we do a single measurement from the "outside". This will help us to ensure we don't have "leakage" — that is, time that is not accounted for. Or at least it will give us an idea how much leakage there is.

nikomatsakis on 30 May 2018

👍2

Thanks for the feedback @nikomatsakis and @michaelwoerister! Those are all great ideas and I'll start working on those.

wesleywiser on 30 May 2018

❤2

So it looks like the output is being produced directly by rustc, is that right?

nnethercote on 31 May 2018

@nnethercote Yes, that's correct.

wesleywiser on 31 May 2018

Progress update: I've implemented much of the above feedback. The output now looks like this:

wesley@endurance:~/code/rust/rustc-serialize> RUSTFLAGS="-Zself-profile -Zprofile-json" cargo build
   Compiling rustc-serialize v0.3.24 (file:///home/wesley/code/rust/rustc-serialize)
Self profiling results for rustc_serialize:

| Phase            | Time (ms)      | Queries        | Hits (%) |
| ---------------- | -------------- | -------------- | -------- |
| Parsing          | 27             |                |          |
| Expansion        | 132            |                |          |
| TypeChecking     | 2182           | 3321830        | 96.17    |
| BorrowChecking   | 144            | 28872          | 69.71    |
| Codegen          | 2065           | 110571         | 79.61    |
| Linking          | 17             | 11320          | 83.46    |
| Other            | 590            | 4922566        | 97.07    |

Optimization level: No
Incremental: on
    Finished dev [unoptimized + debuginfo] target(s) in 5.29 secs

In addition, -Zprofile-json causes a file containing the timing data to be created:

wesley@endurance:~/code/rust/rustc-serialize> cat self_profiler_results.json 
[
            {
                "category": "Parsing",
                "time_ms": 27
            },
            {
                "category": "Expansion",
                "time_ms": 132
            },
            {
                "category": "TypeChecking",
                "time_ms": 2182
            },
            {
                "category": "BorrowChecking",
                "time_ms": 144
            },
            {
                "category": "Codegen",
                "time_ms": 2065
            },
            {
                "category": "Linking",
                "time_ms": 17
            },
            {
                "category": "Other",
                "time_ms": 590
            }
        ]

wesleywiser on 6 Jun 2018

❤1

One suggestion: the JSON file should contain enough data to exactly recreate the -Zself-profile output. Currently it appears to lack the information to recreate:

the "Self profiling results for rustc_serialize:" line;
the "Optimization level: No" line;
the "Incremental: on" line;
the "Queries" and "Hits" columns.

If it's possible to get the full command line that rustc was invoked with, I suggest including that in the JSON output too. That way if you end up with a bunch of different JSON files you have full data on how each one was created.

Finally, whenever you have coarse categories like this, the natural inclination is for users to want more detail. "Codegen takes 2065ms... can we break that down some more?" (I experienced this back in the early days of Firefox's about:memory.) This naturally takes us back to the idea of a tree-like structure, where you have categories, sub-categories, sub-sub-categories, etc.

nnethercote on 6 Jun 2018

👍1

a while ago I did some experiments with replacing time-passes with chrome tracing events I have updated my branch of rustc now and have some results posted on this page.

a sad thing is that the pages generated only works in chrome.

For some of the events I have extra data to describe the event so it is possible to see why some parts takes more time.

maybe some thing similar can be used for the verbose mode.