Rust-analyzer: Compress rarely modified files

Created on 21 Feb 2019 · 24Comments · Source: rust-analyzer/rust-analyzer

Crazy idea: source code occupies a non-negligible amount of memory. For rust-analyzer, it is

4222 (47mb) files

which actually is worse than I expected (could this be a bug? Do we include unrelated files?).

It might be a good idea to compress this code on the fly! Specifically, we can store text not as Arc<String>, but as an opaque TextBuffer object, which can compress/decompress large text on the fly. Specifically, we should compress all files after the initial indexing of the project, and decompress them on-demand.

This shouldn't be to hard to implement actually!

To clarify, I still think it's a good idea to keep all the source code in memory, to avoid IO errors, but we could use less memory.

E-medium fun

Source

matklad

All 24 comments

I wonder if it would be possible to have some sort of LRU caching for the compressed source, where you compress everything, but things that are being frequently changed, may stay in memory uncompressed to avoid unnecessary compression/decompression. I guess it also depends on the compression itself, what kind of overhead it has.

vipentti on 21 Feb 2019

👍1

Why is the source stored at all? Can't it be read from disk as needed?

jrmuizel on 21 Feb 2019

@jrmuizel it's important to let no arbitrary IO into the core incremental computation. We can't guarantee that reading a file twice will yield the same result, and, if we get different results in the same incremental session, we'll be in an inconsistent state.

What should be possible is to "copy" files to some ".rust-analyzer" dir and read them from there, with a contract that an IO error while reading from this private to rust-analyzer dir is fatal and requires restart.

Overal spending 50 megs of RAM to store text seems much better deal than dealing with IO in any form. A good thing about compression is that it gives use memory savings in a purely-functional context.

matklad on 21 Feb 2019

I wonder if it would be possible to have some sort of LRU

The simplest form of LRU is "compress everything once in a while". This is what we do for syntax trees, and it seems to work.

matklad on 21 Feb 2019

How does ra_vfs relate to the compression, ra_vfs monitors the files and has them in memory right? So should the compression happen already in ra_vfs ?

vipentti on 21 Feb 2019

Yeah, I think so!

Currently, Vfs stores text as text: Arc<String>, and it could be changed to a more abstract types. I can't sketch the whole design off the top of my head, I expect there will be some interesting questions about lifeimes and interior mutability. If we are introducing a new type for this, it might also be a good occasion to switch LSP layer to patch files with edits on modifications, instead of asking the client for the whole text buffer every time.

matklad on 21 Feb 2019

Hi! I've noticed that a lot of non-Rust files (LICENSE, AUTHORS, Dockerfile, COPYING, .gitignore, etc.) are included in the salsa db. Is this by design?

marcogroppo on 5 Mar 2019

@marcogroppo that's definitely a bug, only .rs files should be included

matklad on 5 Mar 2019

found it:

https://github.com/rust-analyzer/ra_vfs/blob/beac2769f48474a7dc33014a982614c5c13804ea/src/roots.rs#L97-L100

Here, we include extension-less files. This is so that we don't ignore directories. We should probably do additional filtering somewhere on io layer to filter-out extension less files.

matklad on 5 Mar 2019

👍1

I did a quick check and with the ra_vfs patch the memory occupied by rust-analyzer's source code is now 3586 (38mb) files (without the patch: 4305 (47mb) files). Another thing I've noticed is that the source code includes tests, benchmarks and examples from libcore and other dependencies

marcogroppo on 6 Mar 2019

👍1

This seems like an interesting idea but one should note that some operating systems already compress memory pages when under pressure (macOS by default, Linux with zram).

killercup on 6 Mar 2019

❤1

I think once we can properly ignore files that are not necessary, like tests, benchmarks or examples from external sources, the amount of files should be reduced even further.

vipentti on 7 Mar 2019

I think once we can properly ignore files that are not necessary, like tests, benchmarks or examples from external sources, the amount of files should be reduced even further.

I think we should extend vfs API to allow to specify exclusion together with the roots. Than, we can change the logic in rust-analyzer to ignore tests|benches|examples for crates from crates.io.

matklad on 7 Mar 2019

Could we use the ignore crate for this?

kjeremy on 7 Mar 2019

I think we can use ignore to at least get .gitignore support, maybe we could use it to ignore other things as well ?

vipentti on 7 Mar 2019

Yeah, using gitignore is fine!

We only need to think carefully about the interface between VFS and the the rest of the world, such that consumers could flexibly choose the strategy. Perhaps VFS should just accept a BoxFn, such that using gitignore is strictly consumer’s business?

matklad on 7 Mar 2019

Wouldn't it be good to include the examples, tests and benchmarks, so things like go to definition and find references keep working?

lnicola on 7 Mar 2019

@lnicola for crate.io dependencies I think that is not important

matklad on 7 Mar 2019

Good point. But for the current project they are.

lnicola on 7 Mar 2019

Something we've discussed with @Xanewok at zulip is that we can also fold parsing into the mix and have a three-state repr:

enum SourceState {
    Compressed(Vec<u8>),
    Decompressed(String),
    Parsed(TreeArc<ast::SourceFile>),
}

the repr could change dynamically (so, an interiro mutability is required) depending on access patterns and memory usage. This should also allow us to incrementally reparse files

matklad on 8 Apr 2019

(Another) crazy idea: Store source code and other large (meta)data in a sqlite or similar database-in-a-file system.
This would allow us to reduce memory usage while also having minimal effects on performance.

The new dependencies are not insignificant, but they would probably be acceptable.
This approach also scales better for huge projects.

spadaval on 3 Jan 2020

@spadaval this might work at the salsa level, see https://github.com/salsa-rs/salsa/issues/10.

lnicola on 3 Jan 2020

I gave this a try at the VFS level, using LZ4:

| Run | RSS (MB) | CPU time (s) | Mem (MB) |
| ------- | -------------: | -----------------: | --------------: |
| before | 812 | 20.24 | 764 |
| before | 812 | 18.69 | 765 |
| before | 814 | 20.74 | 764 |
| after | 841 | 19.76 | 751 |
| after | 842 | 21.58 | 751 |
| after | 813 | 19.54 | 751 |

Uncompressed source code is 43 MB. The tests consisted in starting Code with only RA's main.rs open. I didn't use the custom dictionary feature of the LZ4 crate, but that might help a little, too.

Overall I'm not convinced this is worth it, what do you think?

lnicola on 19 Sep 2020

👍1

Yeah, seems like it's not worth it at this time!

Thanks for quantifing the wins here @lnicola , that's super helpful!

matklad on 21 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings