Please read this comment for a summary of changes and structured information about each of those changes. This might be a better starting point than reading through all of the comments in this issue.
As stated in #1988 the memory consumption of the index is one of the biggest issues with memory consumption of restic. The main point is the actual implementation of in-memory index storage in internal/repository/index.go
I would therefore like to start a new issue to discuss strategies that could remidy the high memory consumption.
As a starting point I opened a branch index-alternatives which implements the following strategies:
Index strategy can be chosen by
restic -i <default|reload|low-mem|bolt>
The branch is considered WIP. I added a general Interface FileIndex which should be fulfilled by new index strategies. Also the actual implementation just loads index files (file-per-file) into the standard Index struct and then reloads them into a new data structure allowing the GC to clean up. Newly created index files are not affected, so there should be only an effect for repositories with data in it.
Please note that most implementations are quick-and-dirty and miss serious error handling, clean-up etc.
More index strategies can be easily added by adding a new implementation for FileIndex and adding them in repository.go (where loading takes place) and cmd/restic/global.go (where the flag -i is interpreted)
I appreciate to get feedback on the proposed index strategies.
In this task I would also like to collect good test settings where different index strategies can be compared to each other.
I like to start with the setup I used to test during the implementation:
I created test data with the following script
#!/bin/sh
mkdir data
for i in `seq 1 1000`; do
echo $i
mkdir data/$i
for j in `seq 1 100`; do
echo $i$j > data/$i/$j
done
done
This creates 100.000 small different files in 1000 directories. Therefore this should give a lot of data blobs and still quite some tree blobs. Can be changed to simulate even more files..
I ran the inital backup
restic -r /path/to/repo init
restic -r /path/to/repo backup /path/to/data
and did a repeated backup run with unchanged data for all index strategies:
/usr/bin/time restic -r /path/to/repo -i <strategy> backup /path/to/data
besides measuring with time I also added memory profiling.
Here are the results:
Files: 0 new, 0 changed, 100000 unmodified
Dirs: 0 new, 0 changed, 2 unmodified
Added to the repo: 0 B
processed 100000 files, 567.676 KiB in 0:08
snapshot d7a3f88b saved
13.14user 1.71system 0:11.16elapsed 133%CPU (0avgtext+0avgdata 90564maxresident)k
0inputs+64outputs (0major+1756minor)pagefaults 0swaps
(pprof) top
Showing nodes accounting for 15.21MB, 97.56% of 15.59MB total
Dropped 84 nodes (cum <= 0.08MB)
Showing top 10 nodes out of 59
flat flat% sum% cum cum%
12.52MB 80.26% 80.26% 12.52MB 80.26% github.com/restic/restic/internal/repository.(*Index).store
1.01MB 6.46% 86.72% 1.01MB 6.50% github.com/restic/chunker.NewWithBoundaries
0.39MB 2.52% 89.23% 0.39MB 2.52% reflect.New
0.36MB 2.29% 91.53% 0.37MB 2.34% github.com/restic/restic/internal/restic.NodeFromFileInfo
0.29MB 1.89% 93.41% 1.10MB 7.05% github.com/restic/restic/internal/archiver.(*Archiver).SaveDir
0.28MB 1.80% 95.22% 0.28MB 1.80% bytes.makeSlice
0.15MB 0.94% 96.15% 0.15MB 0.94% github.com/restic/restic/internal/archiver.(*TreeSaver).Save
0.09MB 0.6% 96.76% 0.09MB 0.6% strings.(*Builder).grow
0.08MB 0.5% 97.26% 0.08MB 0.5% github.com/restic/restic/internal/restic.BlobSet.Insert
0.05MB 0.3% 97.56% 0.38MB 2.46% encoding/json.Marshal
Here, the index already is the huges part of memory. You can also see that time reports 90MB used. This is IMO the garbage collector setting that has been talked about already.
This seemes to work. However it displayed a ETA of about 20 minutes while constantly using 100% CPU... This can be easily explained as it constantly rereads (and decrypts!) the index files..
However, I consider this index strategy is not worth a deeper look....
<snip: same output as above>
processed 100000 files, 567.676 KiB in 0:08
snapshot f8822cee saved
13.47user 2.03system 0:11.90elapsed 130%CPU (0avgtext+0avgdata 109584maxresident)k
0inputs+72outputs (0major+1945minor)pagefaults 0swaps
(pprof) top
Showing nodes accounting for 7.68MB, 94.71% of 8.11MB total
Dropped 88 nodes (cum <= 0.04MB)
Showing top 10 nodes out of 63
flat flat% sum% cum cum%
4.84MB 59.68% 59.68% 4.84MB 59.68% github.com/restic/restic/internal/restic.IDSet.Insert
1.01MB 12.50% 72.18% 1.02MB 12.58% github.com/restic/chunker.NewWithBoundaries
0.41MB 5.05% 77.23% 0.41MB 5.05% reflect.New
0.35MB 4.26% 81.48% 0.35MB 4.26% github.com/restic/restic/internal/restic.NodeFromFileInfo
0.28MB 3.47% 84.95% 0.28MB 3.47% bytes.makeSlice
0.27MB 3.31% 88.26% 1.01MB 12.46% github.com/restic/restic/internal/archiver.(*Archiver).SaveDir
0.19MB 2.33% 90.60% 0.19MB 2.33% github.com/restic/restic/internal/archiver.(*TreeSaver).Save
0.18MB 2.22% 92.81% 4.94MB 60.93% github.com/restic/restic/internal/repository.(*IndexLowMem).store
0.08MB 0.96% 93.78% 0.08MB 0.96% github.com/restic/restic/internal/restic.BlobSet.Insert
0.08MB 0.93% 94.71% 0.40MB 4.92% encoding/json.Marshal
Total memory reported by the profiler is halved! Morover, the 12,5 MB Index data set is replaced by a less than 5 MB IDSet. I would estimate that with this approach the index memory requirement can be reduced to about 40%!
Trade-Off is a slightly higher total backup time (13,5s as compared to 13,1s with standard index).
I'm really interested how this index strategy performs in other test settings!
<snip: same output as above>
processed 100000 files, 567.676 KiB in 0:30
snapshot 9ff3d71f saved
37.01user 3.25system 0:34.54elapsed 116%CPU (0avgtext+0avgdata 282816maxresident)k
0inputs+114096outputs (0major+5436minor)pagefaults 0swaps
(pprof) top
Showing nodes accounting for 2592.31kB, 90.31% of 2870.50kB total
Dropped 79 nodes (cum <= 14.35kB)
Showing top 10 nodes out of 81
flat flat% sum% cum cum%
1030.83kB 35.91% 35.91% 1037.16kB 36.13% github.com/restic/chunker.NewWithBoundaries
390.02kB 13.59% 49.50% 390.02kB 13.59% reflect.New
278.60kB 9.71% 59.20% 278.60kB 9.71% github.com/restic/restic/internal/restic.NodeFromFileInfo
274.66kB 9.57% 68.77% 933.59kB 32.52% github.com/restic/restic/internal/archiver.(*Archiver).SaveDir
244.05kB 8.50% 77.27% 244.05kB 8.50% bytes.makeSlice
169.81kB 5.92% 83.19% 169.81kB 5.92% github.com/restic/restic/internal/archiver.(*TreeSaver).Save
80kB 2.79% 85.98% 80kB 2.79% github.com/restic/restic/internal/restic.BlobSet.Insert
44.17kB 1.54% 87.52% 292.20kB 10.18% github.com/restic/restic/internal/archiver.(*TreeSaver).save
40.16kB 1.40% 88.91% 40.16kB 1.40% strings.(*Builder).grow
40kB 1.39% 90.31% 40kB 1.39% github.com/restic/restic/internal/restic.NewBlobBuffer
It also creates a 72 MB my.db file where the index is saved.
As expected, the memory consumption is basically gone!
This however comes with a big tradeoff in backup time (37s as compared to 13.1s with standard index).
I think it would be interesting to see how this performs with bigger data sets. No clue if this performance issue just adds a factor of about 3 or is even growing more than linear.
Moreover, it seems to me that much more testing and fine-tuning should be considered with storage-based index strategies...
As in the bolt setting the db is not deleted after the run, I reran with an already present db:
processed 100000 files, 567.676 KiB in 0:08
snapshot b364fba4 saved
14.26user 1.94system 0:12.07elapsed 134%CPU (0avgtext+0avgdata 133464maxresident)k
0inputs+88outputs (0major+2601minor)pagefaults 0swaps
Memory profile is almost identical to the first run which is as expected.
I really wonder why this second run is so much faster than the first. Is it the fs cache that is working now? Or does boltdb somehow uses the info of the already stored index to speed things up?
Generally this may indicate that with storage-based index we might have to think about a permanent stored index, because it is able to boost performance quite a lot!
Maybe worth trying an LSM-based Key/Value store too. LSMs have quite different performance and disk-footprint tradeoffs compared to BTrees, and may work better for the persistent index usecase. I'd personally use RocksDB but there are probably good-enough Go-native stores too.
Thank you for prototyping this! Looks very encouraging.
The index files are currently encrypted, so whatever disk store we use should be encrypted too. I looked at https://github.com/dgraph-io/badger - it's Go native, LSM-based and supports encryption.
Low memory index looks good for backups. I'm not sure if it will scale for restores from repos with huge number of blobs though.
Thanks for the discussion so far.
I have to admit that I didn't spend much time in trying other storage-based index options. (badger is still on my to-do-list)
The reason is the following: I see that there will be quite some issues to arise when seriously trying to bring this in production-grade state. I see issues about encryption, how to clean up files ans how to deal with aborted operations and so on.
This is why I started thinking about another possibility for optimizing in-memory index. The result is already in my branch index-alternative and named low-mem2. The idea is to store all index data in sorted tables (saving the overhead of the map) an try to not store anything duplicated. The pack ID for instance is saved in a separate table and only referenced in the main table.
I did not run much tests so far but it seems that performance is just a litte worse than the standard index. On the other side memory consumption is comparable to low-mem, that is saving around 60% compared to standard index in my test setting, see above.
Considering that for most operations (e.g. backup) only specific information of the index is needed, this can be further improved.
So I would suggest the following: I will try to make a PR to replace (or optionally replace) the standard index by the memory-optimized one (low-mem2). This will hopefully give a fast improvement about the memory issue.
The possibilities to use a disc-based index is then open for the future.
What is your opinion about it? Maybe some of the core maintainers can give me a hint about the right direction to work on. Thank you!
Cutting the memory usage in half, sound like a reasonable short-term solution to me. I'm currently a bit worried about a possible performance hit by keeping the data in a sorted array. The current implementation of the MasterIndex simply iterates over a list containing the indexes. For larger repositories, with a few thousand index files, restic spends a large fraction of the time for backups / checks with iterating over the different indexes, see #2284.
Regarding your low-mem2 branch: I'm a bit confused by the SortByBlob and SortByPack functions. Shouldn't it be possible to just sort by blobID and then just iterate over the whole index for lookups based on the packID (as is done by the current implementation)? (My preference would be to get completely rid of the locking once an index has been finished)
The code currently is just able to convert an existing Index to the new format. It would be rather nice if it were able to convert a finalized Index to the more efficient format, to get the improved memory consumption also during the initial backup.
@MichaelEischer: Thanks for pointing out this issue in masterindex. As far as I can see, there are three possible ways the index is used:
As 1. is the primary use case and 2. is quite similar I agree that we should optimize for this. I would propose to also change the way master index is working and allow an index structure to be filled by the index data of many index files that have been read (and also for the index files created and then only used read-only).
About low-mem2:
So how do we continue? As already mentioned I can prepare a PR to exchange the current index/masterindex implementation by an optimized one.
@MichaelEischer I would appreciate to get early feedback and improvement suggestions - may I give you some early parts of the work for review? Maybe we'll find also some people testing the changes with respect to performance and memory usage for real-life scenarios...
@aawsome Sorry for the late reply. I could review early parts of the work, however, it would also be best to get some comments on the overall design by the core maintainers. For testing I have an 8TB repo with 40 million files at hand; For my index performance tests I've so far used the check command which stresses the index a lot.
The MasterIndex is used for case 1 and 2. Merging all index files in-memory for a static index which would suffice for case 1 should be rather easy to implement. A single sorted list will definitely be faster than the current index once it reaches a certain amount of index files O(log n) vs. O(n). Dynamically adding full/finalized index files (case 2) does not easily combine with a single sorted blob list as it would need to be resorted after each added index. Fully resorting an index with hundreds of millions of entries several times sounds like a rather bad idea to me. [EDIT] Just merging two already sorted lists also works and basically just amounts to a full copy of the existing list. With hundred million entries this most likely still takes a lot of time [/EDIT] . Dynamically growing such a list would also require some way to make room for new entries without having to keep a full copy of the old and new list buffer in memory.
The alternatives to a sorted list that come to my mind right now would be some sort of tree or a hashmap. The former probably would require some additional pointers in the data structure and the latter would be rather similar to the current implementation. I've thought about replacing the MasterIndex with a single large hashmap, but right now the growth strategy for the backing buffer of the hashmap kept me from trying: Go doubles the backing buffer size once a certain load factor is reached, which in the end might require temporarily keeping a rather emtpy new backing buffer along with the old one in memory. This scenario would probably require much more memory than the current implementation. This should be solvable by using two levels of hashmaps where the first one is indexed using the first few bits of the blob ID. I'm just not sure how to efficiently handle blob IDs which are not uniformly distributed and thus could still lead to large imbalances in the second-level hashmaps. It might be worthwhile to look a the index data structures used by databases or borg-backup.
Case 3 (prune/rebuild-index/find) is handled by index/Index.New right now, so this shouldn't collide with changes to the MasterIndex.
ListPack methods on Index/MasterIndex looks like dead code to me, restic still compiles when I remove that method (only the Index test needs some tweaks). The only usages of a ListPack method I could find are on the Repository struct. This would make SortByPack unnecessary and avoid the complexity associated with resorting.
Searching for packs in a sorted-by-blob list would work but probably a bit slower than the current implementation. The current implementation iterates over all indices and stops once an index containing the packs was found, which on average requires traversing half of the list. Scanning the sorted-by-blob list would always require a full scan. On the other hand, the scanning should be faster, so the overall performance should be more or less comparable.
@aawsome So right now the focus of the changes would be to optimize the memory usage of the index along with the masterIndex performance, while deferring fundamental changes like using an on-disk database to later?
Regarding case 3: Do plan to also replace index/Index which is grouped by packID? My current understanding of the restic code is that this would be the only potential users of ListPack.
Right now I can only find two active uses of index/Index.Packs:
cmd_list/runList which just needs some kind of blob list and doesn't really care about sortingcmd_prune/pruneRepository which uses it for statistics, iterate over all blobs to find duplicates and to make the decision which pack to rewrite or remove. Only the last usage is a bit more complex to avoid as it would require some additional hashmaps to keep track of used blobs per pack.Your "Optimize Index" merge request would add another usage which actually requires an index grouped by pack files.
Regarding an optimized MasterIndex: My last comment mostly looked at using a single large index datastructure that would probably need to use partitioning by key to avoid large spikes in memory usage when adding new entries. That way of partitioning has the downside that adding a new index likely requires touching every partition. In addition imbalances in the partitions size could cause further problems.
However, there is also another option (which is slightly inspired by log structured merge trees): Keep the number of entries in the MasterIndex roughly constant at a value c. This would mean that a new index would after finalizing be merged into one of the existing index parts. That way only a part of the index is modified at a time. For hash maps this could be used to keep the load factor rather high as each part would be filled up before using the next one. For sorted arrays some other strategy sounds more reasonable: Merge index parts into larger ones with roughly 1k, 2k, 4k, 8k, ... entries while only merging the currently largest size class once a certain constant of index parts with that size exists. This should provide O(log n) complexity for incrementally adding new index files. Depending on the constant c either adding new index files is rather cheap (for small index parts) or lookup is fast (low number of index parts).
@aawsome So right now the focus of the changes would be to optimize the memory usage of the index along with the masterIndex performance, while deferring fundamental changes like using an on-disk database to later?
Yes, that is my focus. We can discuss to add options for the backup command to only save tree blobs in the index (and thus stupidly saving each new blob in the snapshot) or maybe only load the data blobs used in the previous snapshot into the index (and thus only having deduplication for blobs in the previous snapshots). This would allow low-memory clients to backup to large repositories followed by prune operations on big-memory machines to do the full deduplication.
I will postpone my work on disc-based index.
Regarding case 3: Do plan to also replace index/Index which is grouped by packID?
I do think that this is possible and should be the goal of the new implementation. I however didn't analyze it in detail - thanks for your hints!
About implementation details: I'll need a bit of time to read and think about it. Thanks you very much for your ideas and support!
As announced, I prepared an early version of the refactored index, see refactor-index
Also some internals are slightly changed to take use of the new index. For instance check should now require much less memory than before...
It's still WIP but almost all tests already pass and it from my tests it looks pretty good..
Here are the open points:
rebuild-indexand prune to use the new MasterIndex and the new in-memory indexThe implementation uses one big (sorted) array to store the index and the IndexJSON format to keep items created in the current run.
When backing up, the IndexJSON is regularly stored to the repository and added to the main in-memory index.
When loading the index from the repository, each index file is loaded into the IndexJSON format and then added to the main in-memory index.
Hence, the cruical function is AddJSONIndex in internal/index/index_new.go, starting from line 278.
About your comments @MichaelEischer I cannot really estimate the impacts of appending to the indextable, see line 329 in internal/index/index_new.go. Does this mean that the whole table is to copied regularly (you mentioned the "memory spikes")? Or is this handled nicely by the memory managment of go and of the operating system?
Also after inserting the table needs to be sorted again (at least before searching it the first time). I guess that the current implementation should avoid unneccessary sort operations, but this needs performance tests. Can anybody help with these tests?
@aawsome:
Can anybody help with these tests?
I'm willing to help with performance testing. What exactly do you want to have tested?
@dimejo:
Thank you very much for your help!
I would prefer 1:1 comparisons (i.e. the identical run with both branches) between the master branch (or last release) and my branch.
I'm interested in differences about memory usage and performance (total time and cpu usage) for large repositories (many files, many snapshots, etc). Using /usr/bin/time -v is ok, maybe enabling profiling (memory profiling and cpu profiling) can help to better explain differences.
The backend doesn't matter at all - local backend is fine but no must.
I think the following commands should at least be testet:
backup 1. initial run 2. another run with (almost) unchanged files; 3. another run with --forcecheckrestoremountlist blobscat blobfind --blob[edit: added list, cat and find commands]
Please write if you don't understand what I mean or if you need more detailed instructions how to run these test!
I'll try to do all the tests this weekend but I need some help enabling profiling. I'm not a programmer and following advice on the internet didn't work out that well :/
You can start with
/usr/bin/time -v restic <command>
To enable profiling, first compile restic with the debug option:
go run build.go --tags debug
and then use the flag --mem-profile or --cpu-profile, e.g.:
restic --mem-profile . <command>
This creates a .pprof file that can be analyzed with go tool pprof, e.g.:
go tool pprof -pdf mem.pprof restic
returns a nice pdf.
Awesome! I thought it would be much more complex to enable profiling. Will try it out and report back.
@aawsome Thanks for your work, that's really a lot of changes.
I did some test runs with the check command and something strange happens. Using the restic master the command completes in 6 minutes (327.25 real 613.08 user 208.48 sys), with the index changes it takes more than an hour (4516.77 real 896.17 user 531.71 sys). The time is spent somewhere after printing "check snapshots, trees and blobs", but didn't have time to take a closer look yet.
In my opinion your commit with the index changes is far too large to be reviewable, thousand lines of added code along with two thousand deletions is really a lot.
I've noticed at least four individual changes:
I have some (incomplete) comments on the code itself but I didn't think too much yet about the overall code architecture:
struct IndexLowMem2.byFile: It took me quite some time to understand the variable name. Please change it to byPackFile or similar to be more descriptive.
IndexLowMem2/FindPack: My understanding of restic.IDs.Find is that it expects a sorted list of IDs. However, idx.packTable does not seem to be sorted as this would break the packIndex in blobWithoutType.
Race condition in index_new/FindBlob: Index might be resorted between the call to SortByBlob and reacquiring the read lock. Currently I don't see a nice way how this could be fixed when using a RWMutex. The same problem also exists in FindBlobByPack.
IndexLowMem2/Lookup: The call to FindBlob and accessing the blobTable is not atomic.
IndexLowMem2/AddJSONIndex: Calls to append for a slice allocate a new buffer when the old one is too small and then copy all data over to the new buffer. As the method first collects the new packTable/indexTable entries and then appends them in one go, this shouldn't be too inefficient.
The packTable slices in idx.byFile therefore will point to old versions of the packTable array when append allocates a new buffer. This will probably waste quite a lot of memory for large packCounts. You could just add an endPackIndex to FileInfo in addition to startPackIndex and then directly access the packTable.
Appending to it.blobTable looks a bit expensive as this requires sorting later on. It should be possible to presort the new table entries and then just merge the two sorted lists (however, merging inplace might be a bit tricky).
MasterIndex.Lookup/Has/Count/... : After the lookup in mi.mainIndex, mi.idx might be saved and added to the mainIndex before the idxMutex is acquired.
MasterIndex.Save: This seems to block the masterIndex while the index upload is pending. The old implementation in Repository.SaveIndex seems to work without blocking.
Ah, I've found the problem: Right now the call to PrepareCache is commented out. PrepareCache also configures the PerformReadahead function of the Cache. Without this Check has to load every single tree data blob separately from the backend.
@MichaelEischer: Thank you very much for giving your very valuable comments to this early stage of implementation :+1:
I do know that it is quite a lot of code change - however I think it is also a simplification of the current index implementation. Hence the large number of deleted files/code lines.
Sorry about the PrepareCache issue - I recall I commented it out to handle it later but so far I only did local backend tests and therefore the issue didn't come up..
About your correct comments of the byPack stuff: You are right, this is WIP and not used yet. I left it because I intended to use it later for rebuild-indexand prune. However, I will fist focus on getting the the main use case to work - maybe I'll remove this part before making a PR.
About the sorting: Right now the code is supposed to do the sorting only when needed, i.e. within the first call Lookup or Has. This means when loading many index files into the in-memory index (which is the usual operation at beginning) no sorting takes place. It only happens when all index files are already loaded. During backup - when more index files are written - entries are added to the in-memory index and then each time a resort is needed. This however is a resort of a mostly sorted list. I don't know how sort.Sort handles almost sorted list but usually sort algorithms are used that are very efficient for almost sorted lists.
I'll take care about the other issues - thanks especially for finding the race conditions!
In my opinion your commit with the index changes is far too large to be reviewable, thousand lines of added code along with two thousand deletions is really a lot.
This.
I do know that it is quite a lot of code change - however I think it is also a simplification of the current index implementation. Hence the large number of deleted files/code lines.
This is when you should split your changes into separate individual PRs, so it's manageable and with clear intent.
@rawtaz:
The clear intend is a redesign of the index used in restic to get rid of the problems with the current implementation.
IMO the change is not that big considered that it is a complete re-implementation of the index (and please note: also re-implementing some legacy stuff, see index/index_old_format.go).
However this is WIP and has some issues I'm willing to fix before thinking about how to best integrate this work into the master branch.
I'm looking forward to getting ideas how to split it into smaller parts - can you please give me directions what changes could be accepted as PRs?
I did some test runs with the
checkcommand and something strange happens. Using the restic master the command completes in 6 minutes (327.25 real 613.08 user 208.48 sys), with the index changes it takes more than an hour (4516.77 real 896.17 user 531.71 sys). The time is spent somewhere after printing "check snapshots, trees and blobs", but didn't have time to take a closer look yet.
This should be fixed now.
struct
IndexLowMem2.byFile: It took me quite some time to understand the variable name. Please change it tobyPackFileor similar to be more descriptive.
Is changed now.
IndexLowMem2/FindPack: My understanding ofrestic.IDs.Findis that it expects a sorted list of IDs. However,idx.packTabledoes not seem to be sorted as this would break thepackIndexinblobWithoutType.
The whole byPack logic was not used and you correctly pointed out that there were also some errors within. I cleaned this dead code.
Race condition in
index_new/FindBlob: Index might be resorted between the call toSortByBloband reacquiring the read lock. Currently I don't see a nice way how this could be fixed when using a RWMutex. The same problem also exists inFindBlobByPack.
I hope I found a nice way to handle this. I basically test sortedByBlob twice if it is set to false. This issue hence should be fixed now ..
IndexLowMem2/Lookup: The call toFindBloband accessing theblobTableis not atomic.
This is now done using the new atomic indexTable.Get.
The
packTableslices inidx.byFiletherefore will point to old versions of thepackTablearray when append allocates a new buffer. This will probably waste quite a lot of memory for largepackCounts. You could just add anendPackIndexto FileInfo in addition tostartPackIndexand then directly access thepackTable.
Is fixed now. packTable was so far not used...
MasterIndex.Lookup/Has/Count/...: After the lookup inmi.mainIndex,mi.idxmight be saved and added to themainIndexbefore theidxMutexis acquired.
This is fixed now.
MasterIndex.Save: This seems to block themasterIndexwhile the index upload is pending. The old implementation inRepository.SaveIndexseems to work without blocking.
This issue is still open. I did not find a nice way to handle it with just one unfinished index.
I also do not know the impact of this with respect to performance. Should only affect backup. I hope to see some tests if this is relevant..
I did some test runs with the
checkcommand and something strange happens. Using the restic master the command completes in 6 minutes (327.25 real 613.08 user 208.48 sys), with the index changes it takes more than an hour (4516.77 real 896.17 user 531.71 sys). The time is spent somewhere after printing "check snapshots, trees and blobs", but didn't have time to take a closer look yet.This should be fixed now.
I'll give it a try in the next days.
Race condition in
index_new/FindBlob: Index might be resorted between the call toSortByBloband reacquiring the read lock. Currently I don't see a nice way how this could be fixed when using a RWMutex. The same problem also exists inFindBlobByPack.I hope I found a nice way to handle this. I basically test
sortedByBlobtwice if it is set to false. This issue hence should be fixed now ..
This still leaves the possibility of a collision between SortByBlob and AddJSONIndex. The check that the index is sorted properly and the actual index search must happen in the same critical section. Unluckily the RWMutex does not support an atomic downgrade from a write to a read lock, which would be the easiest way to keep the lock after sorting. I currently see two possibilities:
MasterIndex.Lookup/Has/Count/...: After the lookup inmi.mainIndex,mi.idxmight be saved and added to themainIndexbefore theidxMutexis acquired.
It seems like you've missed MasterIndex.Lookup.
MasterIndex.Save: This seems to block themasterIndexwhile the index upload is pending. The old implementation inRepository.SaveIndexseems to work without blocking.This issue is still open. I did not find a nice way to handle it with just one unfinished index.
I also do not know the impact of this with respect to performance. Should only affectbackup. I hope to see some tests if this is relevant..
Hmm, the main blocker here is that you can't get the index pack id before the upload call completed. It would be rather easy to solve by allowing multiple indexes to be unfinished while only the last one may be written to and the predecessors are currently uploading.
This still leaves the possibility of a collision between
SortByBlobandAddJSONIndex. The check that the index is sorted properly and the actual index search must happen in the same critical section. Unluckily the RWMutex does not support an atomic downgrade from a write to a read lock, which would be the easiest way to keep the lock after sorting. I currently see two possibilities:* Take the readlock, check that the index is sorted and run the search. If the index is not sorted, release the readlock and get the writelock, sort the index and complete just that search while holding the writelock. * Take the readlock, check that the index is sorted and run the search. If the index is not sorted, release the readlock and get the writelock, sort the index and retry from the start.
Oops - I was in fact so focused on SortByBlobs that I didn't consider the methods it is called :-(
Thanks for your suggestions! I used the second option now.
It seems like you've missed
MasterIndex.Lookup.
Is fixed now.
Hmm, the main blocker here is that you can't get the index pack id before the upload call completed. It would be rather easy to solve by allowing multiple indexes to be unfinished while only the last one may be written to and the predecessors are currently uploading.
I also fixed this. I now encode/decrypt the index and write it to the main index while locked but write the file to the backend after releasing the lock.
Moreover I found another optimization that reduces the memory consumption quite a lot (and also fixes the append-problem): I exchanged the large array by a structure that implements a "paged" list of blobs. Here a list of "pages" is maintained and if it needs to be enlarged, just a new page is allocated. This allows to append without needing to recopy all old elements. Also the overhead is fixed and maximum the page size whereas with append you get quite a large overhead (the array gets an increasing "capacity reserve" to ensure it doesn't need to recopy too often). This optimiziation seems to save another ~15-20% of memory.
I'm happy to get feedback and even more like to see performance tests with large repositories,,
Here are the results of my first comparison.
I used a data dir with 550.000 files (results in 550.000 data blobs) using the script
#!/bin/sh
mkdir data
for i in `seq 1 550`; do
echo $i
mkdir data/$i
for j in `seq 1 1000`; do
echo $i$j > data/$i/$j
done
done
Then I backuped this data dir three times and run check after:
/usr/bin/time/restic -r /path/to/repo --mem-profile . backup /path/to/data
/usr/bin/time/restic -r /path/to/repo --mem-profile . backup /path/to/data
/usr/bin/time/restic -r /path/to/repo --mem-profile . backup /path/to/data -f
/usr/bin/time/restic -r /path/to/repo --mem-profile . check
These steps I run once with the master branch (compiled to restic.old) and my branch (compiled to restic.new).
I just used my old laptop with local SSD.
Here are the results:
time_backup1_old.txt
mem_backup1_old.pdf
time_backup1_new.txt
mem_backup1_new.pdf
time_backup2_old.txt
mem_backup2_old.pdf
time_backup2_new.txt
mem_backup2_new.pdf
time_backup3_old.txt
mem_backup2_old.pdf
time_backup3_new.txt
mem_backup3_new.pdf
time_check_old.txt
mem_check_old.pdf
time_check_new.txt
mem_check_new.pdf
To summarize:
master (!!!)check index related memory usage is even decreased more (the IDSet was removed)master index implementation (looping over all index files) didn't play a big role here. I believe that things completely change with many index files and that the new implementation then will outperform the current one clearly.I'm finally done with my tests. It took longer than estimated because I had to re-run my tests because of some mysterious results. I wanted to compare results between 1 and multiple cores. Interestingly the VM with only 1 vCPU was a lot faster with almost all operations. Maybe @fd0 knows what's going on.
1 vCPU
2GB RAM
Note that I've made 1 CPU and 1 memory profiling run for each restic version. Time shown (in seconds) in this table is the average for both runs.
| | restic v.0.9.6 | restic new-index | difference |
| :--- | ---: | ---: | ---: |
| init repo | 2,24 | 2,27 | 1,12% |
| 1st backup | 1574,36 | 1542,80 | -2,00% |
| 2nd backup | 556,95 | 541,71 | -2,74% |
| 3rd backup (--force) | 1192,90 | 1195,53 | 0,22% |
| forget | 0,51 | 0,52 | 2,97% |
| prune | 43,87 | 44,02 | 0,34% |
| check | 30,77 | 31,59 | 2,68% |
| list blobs | 2,92 | 3,36 | 14,90% |
| cat blob | 2,58 | 2,60 | 0,97% |
| find blob | 22,86 | 21,17 | -7,39% |
| restore | 895,01 | 883,57 | -1,28% |
8 vCPU
32GB RAM
Note that I've made 1 CPU and 1 memory profiling run for each restic version. Time shown (in seconds) in this table is the average for both runs.
| | restic v.0.9.6 | restic new-index | difference |
| :--- | ---: | ---: | ---: |
| init repo | 2,11 | 2,09 | -0,95% |
| 1st backup | 1894,47 | 1832,65 | -3,26% |
| 2nd backup | 827,95 | 776,38 | -6,23% |
| 3rd backup (--force) | 1414,60 | 1411,98 | -0,19% |
| forget | 0,60 | 0,56 | -6,72% |
| prune | 93,10 | 89,84 | -3,50% |
| check | 70,22 | 68,44 | -2,53% |
| list blobs | 3,86 | 7,24 | 87,68% |
| cat blob | 3,88 | 3,69 | -4,90% |
| find blob | 30,95 | 29,11 | -5,95% |
| restore | 1150,49 | 1089,66 | -5,29% |
$ uname -r
5.4.15-200.fc31.x86_64
$restic-orig version
debug enabled
restic 0.9.6 (v0.9.6-40-gd70a4a93) compiled with go1.13.6 on linux/amd64
$ restic-newindex version
debug enabled
restic 0.9.6 (v0.9.6-43-g5db7c80f) compiled with go1.13.6 on linux/amd64
files changed added for backup 1:
Files: 487211 new, 0 changed, 0 unmodified
Dirs: 2 new, 0 changed, 0 unmodified
Added to the repo: 62.069 GiB
files changed added for backup 2:
Files: 166940 new, 0 changed, 487211 unmodified
Dirs: 0 new, 2 changed, 0 unmodified
Added to the repo: 6.805 GiB
files changed added for backup 3:
Files: 654215 new, 0 changed, 0 unmodified
Dirs: 2 new, 0 changed, 0 unmodified
Added to the repo: 4.029 MiB
logs_vm1_cpu.zip
logs_vm1_mem.zip
logs_vm8_cpu.zip
logs_vm8_mem.zip
@dimejo Thank you very much for your tests!
First regarding your question about increased time for the 8 CPU setting: You are comparing the "user time" that is the total CPU seconds used. With parallel processing this is (and should be higher) than using a single CPU as there is always some overhead due to parallelization. In fact your 8 CPU runs were much faster, see the elapsed time.
You used the commit 5db7c80f for your newindex tests. Is it possible that you retest with the latest commit 26ec33dd? There should be another decrease in memory usage and (if I did everything right) maybe even better performance.
In general I'm quite satisfied with these results! Seem my implementation is using much less memory while still having similar performance than the original index implementation. (BTW: I'll fix the obviously open issue with list blobs)
First regarding your question about increased time for the 8 CPU setting: You are comparing the "user time" that is the total CPU seconds used. With parallel processing this is (and should be higher) than using a single CPU as there is always some overhead due to parallelization. In fact your 8 CPU runs were much faster, see the elapsed time.
Thanks for the explanation. Stupid me didn't even see the elapsed time ...
You used the commit 5db7c80f for your newindex tests. Is it possible that you retest with the latest commit 26ec33dd? There should be another decrease in memory usage and (if I did everything right) maybe even better performance.
Sure.
Note that I've listed elapsed time (in mm:ss) now. Again showing average time for both runs.
| | v0.9.6 | newindex | newindex2 | difference (v0.9.6 and newindex2) |
| :--- | ---: | ---: | ---: | ---: |
| init repo | 00:42,4 | 01:11,6 | 00:21,5 | -49,27% |
| 1st backup | 26:42,7 | 26:58,2 | 31:00,4 | +16,08% |
| 2nd backup | 10:18,9 | 09:54,1 | 09:03,7 | -12,14% |
| 3rd backup (--force) | 17:32,6 | 17:05,8 | 20:37,1 | +17,52% |
| forget | 00:00,8 | 00:00,9 | 00:00,8 | +0,61% |
| prune | 01:02,5 | 00:56,5 | 00:51,2 | -18,16% |
| check | 00:31,6 | 00:30,8 | 00:31,0 | -1,74% |
| list blobs | 00:05,1 | 00:06,2 | 00:05,6 | +9,44% |
| cat blob | 00:02,2 | 00:02,3 | 00:02,1 | -3,39% |
| find blob | 00:28,5 | 00:28,2 | 00:23,0 | -19,28% |
| restore | 13:02,0 | 12:23,3 | 14:09,1 | +8,58% |
This table shows total memory usage (in MB) from the pprof files. Hope that's the correct one.
| | v0.9.6 | newindex | newindex2 | difference (v0.9.6 and newindex2) |
| :--- | ---: | ---: | ---: | ---: |
| init repo | 32,01 | 32,01 | 32,00 | -0,03% |
| 1st backup | 227,50 | 178,48 | 174,77 | -23,18% |
| 2nd backup | 227,89 | 157,92 | 149,52 | -34,39% |
| 3rd backup (--force) | 204,53 | 147,15 | 136,06 | -33,48% |
| forget | 32,25 | 32,09 | 32,26 | +0,03% |
| prune | 167,55 | 108,96 | 130,60 | -22,05% |
| check | 163,29 | 75,06 | 70,34 | -56,92% |
| list blobs | 33,22 | 30,21 | 25,47 | -23,33% |
| cat blob | 113,79 | 48,86 | 32,23 | -71,68% |
| find blob | 79,48 | 30,40 | 25,64 | -67,74% |
| restore | 226,08 | 176,54 | 198,19 | -12,33% |
8 vCPU
32GB RAM
$ restic-newindex2 version
debug enabled
restic 0.9.6 (v0.9.6-44-g26ec33dd) compiled with go1.13.6 on linux/amd64
@aawsome : I still have problems with the check command not caching tree blobs. The old code called Repository.SetIndex which also triggered the calls to PrepareCache. Now the call to SetIndex is removed, however, I couldn't find a replacement for the call to PrepareCache.
@dimejo Thanks you very much for the tests! This gives very good information and the results regarding memory consumption look already really good!
I tried to understand the memory consumption results for v0.9.6, 3rd backup, but I was not able to reproduce your results. May it be that you used the memory consumption of the index (102 MB) instead of the total memory (204 MB)?
I also do not understand why newindex2 shows higher memory consuption than newindex for prune and restore. In the memory profile you can see that the memory used by the index is lower and the higher memory consumption is due to encoding/json.Marshal and restic.NewBlobBuffer in pruneand restorer.(*packCache).get in restore (which are basically not present in the profile for v0.9.6). Also the memory consumption showed by time is lower for newindex2 than it is for newindex.
So I believe that newindex2 also has lower memory consumption for these two commands than newindex, but I don't understand why the profiling shows these extra (non-index related) parts consuming memory..
About the total time: I'm a bit irritated that newindex and newindex2 differ so much and that the differences are sometimes positive and sometimes negative. I tried to look at the cpu profiles of some runs and did not see any index related cpu usage. For instance with the three backup runs the cpu consuming parts in all (v0.9.6 and newindex2) runs are totally dominated by sha256, chunker, syscall and runtime and I do not see why there should be differences due to a different index implementation.
Can it be that the machine doesn't have constant cpu power and the differences are not caused by different implementations but by different machine circumstances? When you compared v0.9.6 and newindex, the results did not vary much and as far as it looks these two runs were done at approximately the same time.
With other words: is it possible that you rerun the cpu measurement again for v0.9.6 and newindex2 at the same time? Then we could check if variation in cpu timings is common and maybe get better insights about performance changes due to the changed implementation.
@MichaelEischer Thank you for reporting this still unfixed issue and sorry that it doesn't work so far.
The call to PrepareCache is in internal/index/masterindex.go:Load which is called by internal/repository/repository.go:LoadIndex.
I will have to debug that part of the code.
Just leaving another ad-hoc results of the branch with a big-ish repo (~500GB/1500+snapshots) + small-ish backup (1.3GB). Very promising :+1:
default-restic.txt
patched.txt
@MichaelEischer I found the problem with not using the cache with the check command: As check rebuilds the functionality of masterindex.Load() the PrepareCache has to be called explicitly (was in repository.UseIndex()) This is fixed now.
I also fixed the performance issue with list blobs
@seqizz Thanks for testing! I'm quite satisfied with the memory consumption :smile:
Do you have an explanation why the elapsed time in the patched run is much higher than in the default-restic run? I can't explain this behavior by the index changes. Did you see reduced parallelization due to the different implementation or is this an effect that can be explained from within the system you used?
BTW: total CPU times seem to be comparable between the two runs. This is exactly what I was expecting.
@aawsome I'd say this was the side effect of the improper testing by me. Even though I forget the created snapshots between runs, seems like it was slow because I ran the patched version first. Maybe OS caching on our storage-side (Minio) affected the reply time.
Now as a control run, I've built the upstream restic (instead of previous release version, since I've build your branch over upstream) and ran this upstream version first. It shows you can disregard the elapsed times, since it depends on the order..
I tried to understand the memory consumption results for v0.9.6, 3rd backup, but I was not able to reproduce your results. May it be that you used the memory consumption of the index (102 MB) instead of the total memory (204 MB)?
Good catch - probably a copy & past error. It's now corrected in my post above.
I also do not understand why newindex2 shows higher memory consuption than newindex for prune and restore. In the memory profile you can see that the memory used by the index is lower and the higher memory consumption is due to encoding/json.Marshal and restic.NewBlobBuffer in pruneand restorer.(*packCache).get in restore (which are basically not present in the profile for v0.9.6).
So I believe that newindex2 also has lower memory consumption for these two commands than newindex, but I don't understand why the profiling shows these extra (non-index related) parts consuming memory..
TBH, I have no idea what happened there. I tried to reproduce the results on the very same VM but was unable to.
Also the memory consumption showed by time is lower for newindex2 than it is for newindex.
I've compared the results shown by time from all runs from my 4. test (see below), and they don't seem to be too reliable. For some tests the memory consumption between cpu and mem profiling run varies by 15-20%.
restic-orig_test4_cpu_findblob: 253,12MB
restic-orig_test4_mem_findblob: 307,44MB
restic-newindex2_test4_cpu_catblob: 184,12MB
restic-newindex2_test4_mem_catblob: 216,56MB
restic-newindex2_test4_cpu_findblob: 174,81MB
restic-newindex2_test4_mem_findblob: 202,38MB
Can it be that the machine doesn't have constant cpu power and the differences are not caused by different implementations but by different machine circumstances? When you compared v0.9.6 and newindex, the results did not vary much and as far as it looks these two runs were done at approximately the same time.
With other words: is it possible that you rerun the cpu measurement again for v0.9.6 and newindex2 at the same time? Then we could check if variation in cpu timings is common and maybe get better insights about performance changes due to the changed implementation.
The VM I used doesn't have dedicated cores. Especially with 8 cores restic isn't even close to using all available CPU power. That's why I thought it wouldn't matter much for my tests. I've now repeated the test with dedicated CPUs and the results are much more even. Looks like that's what caused the discrepancy.
| | v0.9.6 | newindex | newindex2 | difference (v0.9.6 and newindex2) |
| :--- | ---: | ---: | ---: | ---: |
| init repo | 00:26,2 | 00:48,7 | 00:58,4 | +122,79% |
| 1st backup | 32:51,8 | 32:12,1 | 32:49,0 | -0,14% |
| 2nd backup | 08:40,2 | 08:31,2 | 08:30,3 | -1,90% |
| 3rd backup (--force) | 21:40,5 | 21:43,6 | 21:41,0 | +0,04% |
| forget | 00:00,8 | 00:00,7 | 00:00,8 | -7,98% |
| prune | 00:43,7 | 00:43,8 | 00:42,2 | -3,56% |
| check | 00:26,3 | 00:26,9 | 00:26,4 | +0,30% |
| list blobs | 00:03,5 | 00:04,4 | 00:04,6 | +29,46% |
| cat blob | 00:01,7 | 00:01,8 | 00:01,8 | +9,04% |
| find blob | 00:17,7 | 00:17,3 | 00:16,7 | -5,95% |
| restore | 12:40,2 | 13:20,3 | 12:24,1 | -2,12% |
| | v0.9.6 | newindex | newindex2 | difference (v0.9.6 and newindex2) |
| :--- | ---: | ---: | ---: | ---: |
| init repo | 32,05 | 32,02 | 32,02 | -0,09% |
| 1st backup | 242,38 | 170,70 | 161,80 | -33,25% |
| 2nd backup | 224,63 | 154,78 | 146,69 | -34,70% |
| 3rd backup (--force) | 218,38 | 159,48 | 150,07 | -31,28% |
| forget | 32,01 | 32,24 | 32,05 | +0,12% |
| prune | 183,05 | 110,84 | 113,29 | -38,11% |
| check | 195,05 | 74,28 | 70,33 | -63,94% |
| list blobs | 33,32 | 29,45 | 25,45 | -23,62% |
| cat blob | 92,94 | 49,02 | 33,13 | -64,35% |
| find blob | 111,51 | 29,62 | 25,41 | -77,21% |
| restore | 292,94 | 217,41 | 206,77 | -29,42% |
VM used:
8 vCPU (dedicated CPUs)
32GB RAM
$ restic-orig version
debug enabled
restic 0.9.6 (v0.9.6-40-gd70a4a93) compiled with go1.13.6 on linux/amd64
md5-6ad2d5d28ba7c01c107571884b108e74
$ restic-newindex version
debug enabled
restic 0.9.6 (v0.9.6-43-g5db7c80f) compiled with go1.13.6 on linux/amd64
md5-6ad2d5d28ba7c01c107571884b108e74
$ restic-newindex2 version
debug enabled
restic 0.9.6 (v0.9.6-44-g26ec33dd) compiled with go1.13.6 on linux/amd64
@dimejo Good work - thank you very much!
The list blob performance (and memory usage) is improved a lot after commit 908c8977df05ea88dcad86c37f39d201468446ad.
The other results look already pretty good. I'm currently looking if the index can save some memory for backup and restore by using small refactoring of these commands (as already done with check).
Then I'll work on the tests and make a PR.
@rawtaz My question about which part of this work should be separated to make a PR acceptable is still open. I'll need some directions here if the changes are too large for maintainers to handle..
I'mm still convinced that the complete reimplementation of the index here is key to these good results in memory consumption reduction.
@aawsome Thanks for fixing the check command. I'll give it a try.
Regarding your optimizations for the check command, you might want to take a look at my pull request #2328 which manages to reduce the memory usage even a bit further and is much faster for repositories containing more than a few snapshots.
After looking at the diff between master and refactor-index I get the impression that nearly half of the changes are caused by moving code between repository and index. However, this is not clear when looking at the commits. Is this move actually necessary?
I'm also not really convinced to have a reference to the repository stored in the MasterIndex. The previous way of letting the Repository control the MasterIndex seems to be better separated in myview. Although, it's probably a good idea to move the fallback from the current to the old index format into the index implementation.
I agree that it is probably not possible to reduce the total amount of changes very much without loosing functionality, but it would probably help a lot to just have to look at a set of commits with less than a few hundred lines each.
@aawsome There's still a bug with caching treePacks. The line https://github.com/aawsome/restic/blob/908c8977df05ea88dcad86c37f39d201468446ad/internal/repository/repository.go#L97 should read treePacks.Insert(pb.PackID). This should also reduce the memory usage a bit further.
The current implementation in restic just caches packs that contain only trees. This is equivalent to the new implementation for packs created with recent versions of restic. However, in old restic versions (I think < 0.8) data and tree blobs are mixed together. These data blobs should not be cached. Therefore, I would suggest to add a second iteration over all blobs and remove packs that contains data blobs again.
@MichaelEischer
Thanks for finding another caching bug - I hope it now really is the last one. And sorry for me not able to find it; I so far used only local backend and there is no standard test which checks if caching is set up correctly... Maybe such a test should be added...
This (and the old repo format issue) is fixed in the last commit.
Also thank you very much for your other comment!
About check I agree that there is even more memory optimization potential (and of course speed). I was already thinking about making the index able to save different user-defined flags for blobs. That way you can save huge IDSets or user-defined maps as you save the blob ID only once. This is what I was thinking about when optimizing prune in a second step.
Your PR #2328 can then easily adapted to use index flags instead of the user-defined map and we get a additional memory reduction.
I agree about your design issue and removed repo from MasterIndex. Hence also Load and Save commands went back to repository.go making the diif to master a bit smaller.
However, I do not think that the main goal should be to have a as small as possible merge. I would like to implement the best design. About where to locate the index implementation we already have two places in master: internal/index and internal/repository. As index related stuff is quite some code, I thought it should be separated from repository and so I chose internal/index to be the location to place all about index (and also define a new interface in internal/restic/index.go). All code in index_new.go, index_old_format.go and master_index.go is written from scratch and index.go is heavily adapted while still having the original functionality used in rebuild-index and prune.
On the other side all index related stuff (except loading/saving) is removed from internal/repository.
About optimizing backup and restore:
backup I added the possibility to add "known" blobs to the index. As these are removed once the "known" blob is added to the index, this should save a lot of memory used in backup. I expect that the main memory usage should now be due to the space reserved for the parallel chunker, if you do not have a really HUGE repository.restore I did not find any index-related optimization. It seems that all memory usage is due to the way restore works internally. I do have the feeling that there are optimization possibilities, but this should be done in a separated work.So for me, only working on the test implementation is open (if there are no other bugs that get reported). Then I will make a PR.
@aawsome Hi, sorry for taking long to reply. I feel I'm unable to give you relevant detailed directions, but what I was thinking of was to split things like "refactoring code in preparation for upcoming changes" (such as changing the index handling) and "refactoring code to optimize or add features" (such as the actual optimizations and stuff you're aiming for here) into separate commits or even PRs (depends on the code).
I'm trying to get you some eyes on this that have better things to say than I do, because I do think that your work here is looking interesting!
Also, I appreciate that you have started this contribution by opening an issue (instead of just opening a PR), thank you for that! Whenever possible it's good if we can have a discussion about suggestions before the actual implementation is done. By doing so we can find a common ground and settle on a direction that has everyone on board (and oftentimes unexpected issues or perspectives that change how the implementation would be done arise, which makes for less wasted work in the end).
@rawtaz Thank you for your reply!
You are right that during development I tried to move fast and be able to try out what things do work and what doesn't work. For now, it seems that the code change is mature enough to discuss how it can be best integrated into master.
I think it helps for the discussion to give an overview of the changes I've made and the reasons why I did each change. The implementation covers the following changes:
Major changes:
internal/repository and internal/indexinternal/resticindex_old.go)index_new.go)Minor additional changes:
Lookup and added LookupAllStorePack EDIT: is implemented in #2773 CheckSetKnown to handle known blobs during backup EDIT: is implemented in #2773 blob IDSet from checker.go and replace by index functionalityThe major changes are IMO the key to achieve low memory consumption, good performance and a clean(er) code design (compared to the actual implementation). The minor changes are quick wins that could also be independent PRs. However if they are assumed to be good enhancements I would prefer keeping them in the main PR, else they probably need to be implemented either twice or need to be postponed.
I moved most index related functionality from internal/repository to repository/index. IMO this makes the code base structure clearer and also clearly separates what code is in which directory (as opposed to the actual implementation.
Question to maintainers: Do you acknowledge this principal change?
I also separated interface definitions in internal/restic so that it is clear what's related to Index and what to Repository. This also allows clearer changes of the interfaces in future.
Question to maintainers: Do you acknowledge this principal change?
Currently, MasterIndex combines index-file related data structures and usually iterates over all these sub-structures for index operations. Each of the index-file related data structures can be full (meaning full enough for saving as a full index file to the repo) or finished (meaning being already present as index file).
I completely changed this behavior. Now MasterIndex uses a main in-memory index where the majority of the index entries should be present - especially all those which are also present as index files in the repo - and a second "unfinished" data structure which is only used to save new index entries and is inserted into the main in-memory index once the contents are written as index file to the repo.
Question to maintainers: Do you acknowledge this principal change?
Currently, there are three index-file related data structure definitions: one JSON-defined in internal/index, one JSON-defined in internal/repository and a lookup-optimized one in internal/repository.
I decided to use the JSON data structure from internal/index and added the needed methods to use it in MasterIndex (and completely removed the implementations from internal/repository). This completely removes converting internal data structures for saving index files during backup. On the other side, the lookup-operations need to loop over all index entries. Hence, it is cruical that this data structure does not contain lots of entries during lookup, i.e. it is inserted regularly to the main index.
Question to maintainers: Do you acknowledge this principal change?
The support for new index file formats is now outsourced to index_old_format.go.
Question to maintainers: Do you acknowledge this principal change?
index_new.go)This obviously is the biggest change. In principal one big blob list is maintained per blob type. The list is sorted by blob ID such that lookup can be done using binary search. As optimization aims for many entries in this list, the size of each entry is minimized as much as possible. This means that the pack ID is stored in a separate table and only a reference is saved and offset and length are uint32 (effectively limiting maximum pack size to 4GB). EDIT: these two optimizations are implemented in #2781.
A simple sorted array already saves the memory overhead which is within a map but still generates some memory and performance overhead within append: An extra "buffer" is reserved for future appends and data needs to be copied. Hence I implemented a "paged" array with a small and limited memory overhead and no need to copy for append.
The tests that have been done indicate that this index is a bit faster than the standard MasterIndex implementation while saving more than 75% of the memory.
Question to maintainers: Does this index implementation fulfill your expectations?
Lookup and added LookupAllLookupactually returns more than one results (but not if they are spread over many index files, BTW). In many cases, only the first result is used. Hence I changed Lookup to return only one result and added LookupAll to be used when more than one result is used. Makes code a bit clearer and also saves some unneeded cpu cycles.
Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
StorePackStore stores only one blob into the index but is only called for a whole pack content. I therefore changed it to a StorePack functionality where the whole pack content gets saved. This makes code more clear, storing to the JSON format much easier and also more performant.
Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
CheckSetKnown to handle known blobs during backupThe tests showed that for many new files backup uses quite some memory to store which blobs have been added during this backup run. This is to not save blobs twice by parallel savers. I moved this functionality to the index as this "known blob" information can be removed once the corresponding entry is added to the index. With this change only very few blobs are "known" but not (yet) in the index.
Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
blob IDSet from checker.go and replace by index functionalityChecker builds its own in-memory index and additionally keeps all blob IDs in an additional IDSet. As all blobs are already known to the index, this can be replaced by the Each method of the index.
This saves additional memory within check.
Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
Instead of reading the whole index into memory and then looping over all entries to print the blob list, each index file can be loaded separately into memory and loop over these contents. Makes list blobs much less memory-hungry.
Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
A small test result @aawsome , I am not able to backup into an existing repository after latest commit.
Values from time:
|---|"Normal" backup|fix masterindex.Load() commit|last commit|
|---|---|---|---|
|Memory usage|~650Mb|~320Mb|~55Mb|
|Backup time|~12Sec|~12Sec|(gave up after 5mins)|
Storage reports it stops activity after querying indexes. (Sorry if that was intended due to format change)
@seqizz Thanks for sending the bug report. I have to admit that I only tested the last commit with go run build.go -T which didn't show any issue... I fixed it in commit 5b4009fcd275794e3ce05304fce255d5fa3e1864
@aawsome Thanks for writing that summary comment. One of the things I as a non-core developer (I'm just a maintainer or simple things) experience is that during the time that core developer time is spent on other parts of the project, issues like this one just keeps growing. Then whenever a core developer looks at it, it's a lot or in some cases too much to dig into at the time. The less there is to read for them, the higher are the chances that we can make good use of their time. I took the liberty of editing the initial post in this issue to add a reference to your summary comment.
Regarding your questions, I personally cannot answer them. I will have to defer them a while, and I kindly ask for your patience. Our main focus right now with the available time we have is some other stuff, but I can tell you that this issue and your PR is on my radar and I am trying to get eyes on it eventually.
I think that with your summary comment, besides fixing bugs (I'm slightly concerned that there might be some more small bugs lurking) and whatever you can find that doesn't deviate from what you wrote in your comment, the most productive thing is to let this rest a bit. The next step would be to get core developer eyes on it, and until that happens it's probably more counter-productive to grow the discussion more. Let me know if my rationale with this doesn't seem to make sense, so I can clarify what I mean.
I finally found time to test the performance of the backup command on a 8TB, 40M files repository. To keep the overall runtime low I've just backed up a 6GB directory. CPU usage of restic was limited using GOMAXPROCS=1.
current master: cpu-idx-master.pdf
this pr: cpu-idx-table.pdf
With this PR restic uses around 6.5GB memory instead of 18GB for the current master, which is a gigantic improvement. The index takes around 9 minutes to load in both cases (the host does not support AES-NI so decryption is a bit slow). The backup takes 34mins for this PR and 18mins for the current master. Taking a look at the cpu profiling graphs shows the difference: The indexTable spends 16 minutes just for sorting the in-memory index. The index is large enough that a single sort call for the in-memory index takes about 75 seconds (!) to complete. During this time the whole backup processing is blocked as the index is locked exclusively.
The in-memory index probably needs a way to merge a new index without resorting everything, maybe even implemented incrementally. This should be possible by sorting the new index beforehand and then merge it into the MasterIndex with a single scan over the indexTable.
@MichaelEischer Thank you very much for doing your test! In fact, the current implementation needs a resort of the index table each time an index file is written. So if I calculated correctly, 13 new index files have been created, right?
The good message is that the Lookups in IndexJSON (which are unsorted and thus require linear search) do not show up at all, so this seems to be a good way to handle "unfinished" index files
Actually my judgement is that resorting an almost sorted table with sort.Sort should be almost optimal. (However, trying out sort.Stable or a pure mergesort should still be worth a try...) I do think that the main issue is: If we want to have one sorted index array and insert things into it we need to move ~6.5GB of data in the memory around.
So I propose to separate the index entries read from index files from those newly generated (and already saved). This makes three data structures in the MasterIndex implementation. I already changed this in my refactor-index branch. Would you mind re-testing your setup?
When I zoom into the profile data IndexJSON takes 1.22 seconds in total. So this seems fine for now. The worst case would be a lot of small files in case a map might be faster.
The backup run actually created 16 new index files. The 75 seconds were a guess from a check run which only triggers re-sorting once. Both quicksort (used by sort.Sort) and mergesort have a best-case time complexity of O(n log n), thus presorted data _does not improve performance._ As a side-note the in-memory index requires less space than the on-disk representation ^^ .
I've rerun the test with the updated code and the performance is a lot better: cpu-idx-table-split.pdf
However, using a separate index for new index entries merely hides but does not solve the underlying problem: Just imagine a backup run that adds a few million new files or a large initial backup. The current implementation has O(n * n*log n) complexity. The first n relates to the number of appends to the index which scales somewhat linear with the amount of new backup data. And the latter part of n*log n is the already mentioned complexity for sorting the index. By replacing quicksort with a linear merge this is reduced to O(n * n) which could be good enough for index sizes below several dozen gigabytes. Beyond that it won't scale, but I guess Petabyte repositories are a problem of their own. The linear merge could even be implemented to work incrementally. I have an idea how to reduce the complexity down to O(n * log n) but that would increase the lookup costs to O(log^2 n) and require some extensive code changes.
I wonder whether there is a data structure that would perfectly fit our needs:
log n or lesslog n or less. With n insertions this would then lead to costs of O(n * log n) for a whole backup runWhen dropping the requirement for a really compact in-memory representation, hashmaps are a perfect fit, as their queries and updates happen inO(1). However, they require a lot more memory: My current prototype uses about 70% more memory than this PR.
One solution would be to combine both approaches and collect new entries in a hashmap and merge them into the sorted array from time to time (e.g. when the hashmap's size reaches 10% of the sorted array).
That said it is probably best to wait for feedback from @fd0 before making any extensive changes. (Sorry for growing the discussing again.)
I wonder whether there is a data structure that would perfectly fit our needs:
I'd say there are "many" data structures more or less satisfying these needs, but don't know how much effort it would be to switch to them in restic. Basically any B-tree or generally tree structures should satisfy these needs (hash maps are really inefficient for such scenarios as you also found out yourself in your measurements).
If I'm not mistaken a B-Tree only guarantees that it's nodes are at least half filled. And it would require pointers which would more or less add a 8 byte pointer for every 48bytes of data. If a B-Tree guarantees 70% space usage on average then it would end up with a similar memory overhead as my hashmap based implementation. There might be some variants like B*-Trees which are more efficient though.
@michaeldorner yeah, that's the "old traditional B-tree" - as you wrote, there are many (I've seen at least ten quite different ones so far) B-tree like structures each having different properties. Other well known tree-like structures include e.g. red-black trees (popular for maps which must guarantee consistent insertion/deletion time & space; this is in contrast with hash-based maps which might suffer from "jumps" or other excessive peaks both in time & space eventually leading to severe security/performance problems). There are also various "compressed" maps - feel free to search for them. So there is really a number of options.
Though I'm not against using a small temporary file like discussed in this topic and like e.g. sqlite does - btw. sqlite is highly efficient and I've used it as storage just for the purpose of extremely low memory usage while offering quite high throughput despite using hard disks as temporary cache - feel free to try it out, it might surprise you.
Just my 2 cents :wink:.
The sorted, densely packed array of index entries is basically the gold-standard for memory usage. Yes, you could apply some compression, but that would only help with the blob size/offset and pack id idx (approx. 16bytes) the remaining 32bytes are the blob id (sha256 hashes) which I'd assume to be basically incompressible. I'm not sure whether the potential for another 20-30% lower memory usage would justify the even higher complexity.
Trees always require one (and some unused space like with B-Trees) or two pointers (red-black-tree and others) per elements which would already add 8 or 16 bytes overhead to the approx. 48 bytes of the index entry itself. I'm also not sure how well the Go garbage collector would handle millions or even billions of index entries.
To save the honor of hash tables: cuckoo hashing (constant lookup time) together with 4 entries per bucket is said to achieve over 90% space utilization. If you just use ten of these hash tables then the total memory overhead would also stay below 33% even while growing a hashmap.
I think the main question is more whether we want a single data structure (tree/hashmap) with a certain memory overhead. Or whether we want the lower overhead of a sorted packed array, which will need an auxiliary data structure to provide a reasonable insertion speed. If that auxiliary data structure just contains 10-20% of the overall index data, then it won't matter much whether a data structure has 30% or 70% overhead and in that case Go's builtin hashmap would be the simplest solution.
Using an on-disk index has it's own share of problems: If the index is too large to be kept in memory, then the index performance will probably implode on HDDs (SSDs should be much better off) unless the index is able to benefit from locality between blobs for e.g. one folder/file (in a fully sorted list of blob ids, the access pattern would basically be uniformly at random).
@MichaelEischer thanks for the wrap up. I'd still though encourage you to try disk cache (just to get a glimpse of possible performance, not to be incorporated in Restic) using sqlite3 with different setups (e.g. both with and without index, with increased max memory constraints than the very low defaults, using just one table as key-value map-like storage, etc.) while leveraging some sqlite3 pecularities like builtin integer primary key and multithreaded access without explicit locking (sqlite handles it itself) and using one huge transaction for all accesses and not using WAL as it's significantly slower for reads (but slightly faster for writes).
Using disc cache might also quite significantly benefit from stream-oriented processing, but don't know whether it's a good fit for a quick tryout. Just an idea based on my good experience with sqlite3-based key-value storage :wink:.
@aawsome Could you split your "list blobs" optimization into a separate PR?
I don't plan to merge the blob IDSet removal from checker.go as #2328 also provides that optimization but without having to involve the master index.
That leaves LookupAll as the last minor change. That function would be useful when adding support for damaged repositories to prune. However, without some index performance optimizations first, it will probably just reduce the overall performance of restic. So this will have to wait.
Regarding the major changes, these also have to stay on hold for now.
@dumblob The problem with using sqlite or any other generic database for an on-disk index is that without optimizations that make use of locality between blobs in a file (i.e. blobs within a file end up in nearby packs and are usually accessed together), we'll end up with a pseudo-random access to all parts of the index which is larger than the main memory and therefore has to load data from random locations on the disk. That is a generic database implementation has no chance at all to provide a good performance.
@aawsome Could you split your "list blobs" optimization into a separate PR?
Sure. PR will come soon.
That leaves
LookupAllas the last minor change. That function would be useful when adding support for damaged repositories to prune. However, without some index performance optimizations first, it will probably just reduce the overall performance of restic. So this will have to wait.
In fact, the actual implementation of Lookup is a (partly) LookupAll in the sense that it returns all results within an index file, but only the results of the first index file that has a match.
As the usual case is that there are no duplicates within one index files, this actually does not give duplicates even in cases where they might be needed.
I will prepare a PR to change Lookup and add LookupAll where it is useful. Then we can discuss if this is helpful or not.
Most helpful comment
@rawtaz Thank you for your reply!
You are right that during development I tried to move fast and be able to try out what things do work and what doesn't work. For now, it seems that the code change is mature enough to discuss how it can be best integrated into master.
I think it helps for the discussion to give an overview of the changes I've made and the reasons why I did each change. The implementation covers the following changes:
Major changes:
internal/repositoryandinternal/indexinternal/resticindex_old.go)index_new.go)Minor additional changes:
Lookupand addedLookupAllStorePackEDIT: is implemented in #2773CheckSetKnownto handle known blobs during backup EDIT: is implemented in #2773blobIDSet fromchecker.goand replace by index functionalityThe major changes are IMO the key to achieve low memory consumption, good performance and a clean(er) code design (compared to the actual implementation). The minor changes are quick wins that could also be independent PRs. However if they are assumed to be good enhancements I would prefer keeping them in the main PR, else they probably need to be implemented either twice or need to be postponed.
Structure of code base
I moved most index related functionality from
internal/repositorytorepository/index. IMO this makes the code base structure clearer and also clearly separates what code is in which directory (as opposed to the actual implementation.Question to maintainers: Do you acknowledge this principal change?
Structure of interface
I also separated interface definitions in
internal/resticso that it is clear what's related toIndexand what toRepository. This also allows clearer changes of the interfaces in future.Question to maintainers: Do you acknowledge this principal change?
Structure of MasterIndex
Currently, MasterIndex combines index-file related data structures and usually iterates over all these sub-structures for index operations. Each of the index-file related data structures can be full (meaning full enough for saving as a full index file to the repo) or finished (meaning being already present as index file).
I completely changed this behavior. Now MasterIndex uses a main in-memory index where the majority of the index entries should be present - especially all those which are also present as index files in the repo - and a second "unfinished" data structure which is only used to save new index entries and is inserted into the main in-memory index once the contents are written as index file to the repo.
Question to maintainers: Do you acknowledge this principal change?
Index-file related data structure
Currently, there are three index-file related data structure definitions: one JSON-defined in
internal/index, one JSON-defined ininternal/repositoryand a lookup-optimized one ininternal/repository.I decided to use the JSON data structure from
internal/indexand added the needed methods to use it in MasterIndex (and completely removed the implementations frominternal/repository). This completely removes converting internal data structures for saving index files during backup. On the other side, the lookup-operations need to loop over all index entries. Hence, it is cruical that this data structure does not contain lots of entries during lookup, i.e. it is inserted regularly to the main index.Question to maintainers: Do you acknowledge this principal change?
Support for old index file format
The support for new index file formats is now outsourced to
index_old_format.go.Question to maintainers: Do you acknowledge this principal change?
New in-memory index data structure (
index_new.go)This obviously is the biggest change. In principal one big blob list is maintained per blob type. The list is sorted by blob ID such that lookup can be done using binary search. As optimization aims for many entries in this list, the size of each entry is minimized as much as possible. This means that the pack ID is stored in a separate table and only a reference is saved and offset and length are
uint32(effectively limiting maximum pack size to 4GB). EDIT: these two optimizations are implemented in #2781.A simple sorted array already saves the memory overhead which is within a
mapbut still generates some memory and performance overhead withinappend: An extra "buffer" is reserved for future appends and data needs to be copied. Hence I implemented a "paged" array with a small and limited memory overhead and no need to copy for append.The tests that have been done indicate that this index is a bit faster than the standard MasterIndex implementation while saving more than 75% of the memory.
Question to maintainers: Does this index implementation fulfill your expectations?
Changed
Lookupand addedLookupAllLookupactually returns more than one results (but not if they are spread over many index files, BTW). In many cases, only the first result is used. Hence I changedLookupto return only one result and addedLookupAllto be used when more than one result is used. Makes code a bit clearer and also saves some unneeded cpu cycles.Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
Replaced 'Store' by
StorePackStorestores only one blob into the index but is only called for a whole pack content. I therefore changed it to aStorePackfunctionality where the whole pack content gets saved. This makes code more clear, storing to the JSON format much easier and also more performant.Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
add
CheckSetKnownto handle known blobs during backupThe tests showed that for many new files
backupuses quite some memory to store which blobs have been added during this backup run. This is to not save blobs twice by parallel savers. I moved this functionality to the index as this "known blob" information can be removed once the corresponding entry is added to the index. With this change only very few blobs are "known" but not (yet) in the index.Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
Remove the
blobIDSet fromchecker.goand replace by index functionalityChecker builds its own in-memory index and additionally keeps all blob IDs in an additional IDSet. As all blobs are already known to the index, this can be replaced by the
Eachmethod of the index.This saves additional memory within
check.Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?
Use new index functionality in 'list blobs'
Instead of reading the whole index into memory and then looping over all entries to print the blob list, each index file can be loaded separately into memory and loop over these contents. Makes
list blobsmuch less memory-hungry.Question to maintainers: Is it ok to keep this change or would you prefer to remove it from the index PR?