Dvc: perf: remote indexes

Created on 18 Jun 2019  路  11Comments  路  Source: iterative/dvc

As we discussed making dvc fast should be high priority as poor performance can draw people away easily. The big part of todays slowness is working with remotes, which almost always includes collecting file statuses, which could be slow for bigger remotes. All this leads to some form of indexes.

However, our remotes don't provide a luxury of atomic group writes nor reads nor read-modify-write operations. We still can use the following strategy:

  • make an index dir on remote, say index,
  • write a list of files (with names, checksums, mtimes and/or file sizes) to an index file in that dir say 1.idx
  • later when some client needs to update that, e.g. after pushing some new files it:

    • reads current index,

    • updates it,

    • writes new list to 2.<uuid>.idx,

    • removes 1.idx.

      This way in a case of a race we will have several index files.

  • if a client needs to read an index it downloads all index files and combines them.

Since we will not only have adds, but also deletes we will need smart combine procudure like in CRDTs.

File format to be discussed, simple JSON or gzipped JSON with a list of files may do the job though.

What do you guys think? @shcheklein @dmpetrov @efiop @pared @mroutis

enhancement feature request p1-important performance question research

Most helpful comment

A note on garbage collection.

Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author.

The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn't and wants to remove that it should be still possible to do that:

dvc gc -c --grace-period=0

This is "I know what I am doing even thoufgh I just messed up flag" :)

The grace period might simplify some conflict resolution above. E.g. we might use normal listing in gc instead of index listing to collect orphane files. This should be discussed in #2325.

All 11 comments

First, the ideas we discussed before (just to document them and they might spark more experimentation or help focusing on a smaller problem):

  1. Use hierarchical structure of the DVC cache as an index. Right now we split into up to 256 directories since we use only first two letters of a checksum, but we can (and probably should anyway to keep directories smaller) introduce at least couple more layers. Thus we will have 2^24 potential subdirectories. To get status we can run multiple parallel ls.

  2. Use an index/flag file for the directory. The biggest performance bottleneck in our case is happening when we have large directories. Locally, we have done some optimizations (at least we hope they will work) to get complexity of certain operations to O(1). (Namely, we are reconstructing using hardlinks the full "shadow" copy of the directory inside the cache dir. It makes certain operations behave as if would have worked with a single file.)
    Can we do the same trick with remote? Push that json file we have for a directory to the remote or create a special flag-file that which would mean that all files exist? We will potentially need to use some data structures like @Suor mentioned.

Let me know if something above requires some clarifications :)

Back to the idea of using lock-free self-healing structures. It's hard (for me) to think this all through at once. This stuff can be tricky. Can we first discuss and decide some "data remote model" - what guarantees do we want to provide, what things are okay to happen, what of them are not, etc:

To start, stuff that comes to my mind:

  1. It's ok from time to time to push things twice. One client does not see an object on remote and pushes it again.
  2. It's not ok to miss a push for some reason. One client can see an object on remote which does not actually there. It might cause a data loss, which is not acceptable.

Any other scenarios that come to your mind that are not acceptable?

If you think about it, our .dir cache files are indexes already. If push was successful, we could pretty safely assume that if there is .dir on the remote, then all the files from it are already there as well. What we could do is:

  • On push: push files from dir first and only then push .dir cache file, so anyone that would be trying to pull/status would not see that dir on remote until it is fully there.

  • On pull/fetch/status: check if .dir file exists on the remote and if it does, then assume that all the files in it also exist on remote. This would basically make Nfiles exists checks into 1 .dir exists check, which would be much much faster. Then we would try to download those files and if any are missing, dvc would report that directory as incomplete(it does it already, since we have similar problems even right now).

It is pretty clear that this might be dangerous if anyone is running gc -c, but it has been that way anyway :slightly_frowning_face:

This approach is pretty simple to implement, and is similar to how we trust remote that md5 file is not corrupted there, but we are verifying it when we download it, same as we do with directories.

A recap of related discussions with @dmpetrov and @efiop.

Index size might be an issue. Given 10 million items, 50 bytes per item (checksum + file size) means 0.5GB index size, which will mean significant time to simply download index, which might undermine the whole idea. However:

  • listings are an order of magnitude faster when batch exists calls when we are querying for lots of checksums, downloading an index should be even faster,
  • downloading a file will have progress bar unlike listing, so user won't be in a stuck-and-have-no-idea-if-it-is-still-working situation,
  • there would be no single index file but series of index update files, we can cache previously downloaded/uploaded index update files, so user won't need to download them each time,
  • we will still need to load the entire index into memory to efficiently check millions of checksums against it.

The discussion with @efiop was mostly about optimizing directories. This includes efficiently tracking changes and comparing changed directory to local and remote caches. The thing is solving this will make other optimizations like indexes not really needed - there is a use case of adding many files recursively, but that according to @efiop is about thousands not millions of files and millions of .dvc files will break dvc anyway, since we are building graph each time (this was never checked though).

The core idea to optimizing directory is not loosing a link to its previous listing (which we already have) and comparing it to an actual file list. Having that we can calculate a diff and then check whether a remote has that directory listing and new/modified files, no need to check for all the files. This implies that nobody would remove files before directory listing, which is probably how it works now anyway. To find a previous listing we might use name or similarity heuristics to existent listings, such as number of files, total file size, types (extensions) of files, maybe checksums of some files. To efficiently find a most close directory to the one we are working with at the moment we will need to store metadata files with those heuristics for all the directories and their history to some level. This solution is less universal than indexes and looks harder to implement, it will, however, yield a better performance for large directory case.

We also discussed somehow using cache structure as an index with @shcheklein. It didn't look fruitful at the time, but now I see a strategy around that which could work:

  • modify our exists call to list the whole prefix and cache the results,
  • use cached list for subsequent exists calls having same prefix.

This is a hybrid between listing and batch exists strategy we now have. By changing the length of prefix we can achieve the balance between time taken by single listing and a number of listings. For small number of checksums we will only execute the same number of fast listings, for large number of files we will simply list all the prefixes. 256 prefixes as we have now might not be a good choice, so cache structure would need to be changed. Overall this looks inferior to indexes.

I also explored rclone as @mroutis suggested. Here is my findings:

  • rclone does not use any type of indexes, it simply lists dirs/prefixes and checks checksums to know what is there
  • it is slower by default than dvc (uses 4 transfer and 8 checker routines)
  • it can be made faster by increasing those numbers

Here is a table of test runs against S3:

100k 1kb files

| | push | check |
|-------------------|---------:|--------|
| dvc | 102 min | 35 min |
| rclone | ~220 min | 62 min |
| rclone --opts* 16 | 66 min | |
| rclone --opts* 32 | 33 min | 16 min |

*--opts N means --checkers N --transfers N --max-backlog 100000

100 1mb files (same total size)

| | push | check |
|--------|-------:|---------|
| dvc | 23 sec | 8.8 sec |
| rclone | 28 sec | 7.1 sec |

50 2mb files (same total size)

| | push | check |
|--------|-------:|---------|
| dvc | 25 sec | 7.5 sec |
| rclone | 26 sec | 5.1 sec |

Since we can increase a number of jobs in dvc too and we don't actually need to recheck checksums for cache files - they are in filenames - rclone can't be used to speed dvc up. It can be theoretically used to save some remotes code and support more of them. We will additionally loose time on process starts and make installation more complicated. Anyway indexes are still needed, so using rclone - probably as an additional remote - would be a separate question.

Another takeout from this runs is that we can significantly speed up push/pull/status operations by joining small files. It makes sense to join files to chunks around 1 MB size for transfer and even bigger for checks - if there are still lots of them. This will require sharding directory files into groups of predictable or constant size, which will require directory listings to work properly (any functional partitioning I am aware of will be either non constant shard size or will break on particular small changes - comment below if you know better). Directory listings are discussed above in optimizing directories paragraph. And again this is mostly beneficial for files smaller than 1 MB.

So, the plan. It is an original one with some adjustments:

  • there is an indexes/ dir beside cache,
  • it contains index files named <generation>.<md5>.<file-size>.<format-version>.<ext>,
  • the contents of the file is {filename: size} dictionary serialized with either json or msgpack, possibly gzipped.

Here is a comparison for indexes in those formats containing 1 million keys:

| | file size | pack and dump | load and unpack |
|----------------|----------:|--------------:|----------------:|
| json | 46M | 1.4 sec | 0,78 sec |
| json.gz | 26M | 6,9 sec | 1,15 sec |
| msgpack | 39M | 0,2 sec | 0,39 sec |
| msgpack.gz | 25M | 2,2 sec | 0,71 sec |
| bin.msgpack | 22M | 0,13 sec | 0,36 sec |
| bin.msgpack.gz | 22M | 1,2 sec | 0,47 sec |

Other considerations:

  • msgpack does not have wheels for OS X currently,
  • bin msgpack stores checksums as byte strings as opposed to file names, this makes indexing code aware of checksum to filename transformation, so indexes will only work for cache.

Creating indexes

The most index files will be created on push. The sequence is:

  • download all indexes (done in status op anyway),
  • calculate new files,
  • push files as usual,
  • add a new index file listing all the new files. Index generation is max of existing generations + 1.

If several client push concurrently, then we will have several indexes in same generation. Such conflicts are resolved on read, more on this below.

Using indexes

Instead of running listing or batch exists operation as part of push, pull or status operation we download all the indexes, load them into ChainMap and check against that.

We cache preciously downloaded or uploaded indexes for remote to not download all of them each time. So in fact this is index listing plus a download of absent ones, not full download.

Important notice: until an index lands nobody sees new files. Which is ok, since user will see if push failed. However, this prevents us from resuming interrupted push. To get around of this we can split particularly big pushes into chunks.

One might think of using some file tree stucture to enable efficient exists calls without loading all indexes into memory but selectively reading files on disk. This should not be a part of initial implementation though, since in the case of big batch exists call - the one we are trying to solve here - the best approach is to load index into memory anyway.

Garbage collection

This will work in the following sequence:

  • list remote files using indexes,
  • calculate files to remove,
  • upload a deletion index in the form of {filename: None} dict,
  • remove listed files only if they are added before the deletion index.

Once the deletion index is there files become invisible to everyone.

Conflicts

If two clients push concurrently we will get two indexes in the same generation, however, since our files are defined by checksum and never modified this means that any overllaping files in concurrent indexes are the same. Anyone reading such indexes may just merge the dicts.

If we have concurrent push and gc then:

  1. If push didn't saw a deletion index then it won't add any files that are being deleted, since they are already there,
  2. If push saw a deletion index than all the pushed files will be newer and won't be garbage collected.

As a result we might get concurrent push and deletion indexes in scenario 1 though. Howeveer, in scenario 1 indexes don't have common keys - gc only removes previously listed keys and push only adds previously absent ones, so there would be no conflict.

Reindexing

There are two types of reindex, first is simply merging all the indexes from 1 to some generation, then writes that as a new index of that generation and removes all the joined ones. This could be done to lower index count or pack a push/gc sequence. There is no issues is several clients do that concurrently, since the one will just override another - index file name is determined only by its generation and contents.

The second type of reindexing is relisting all the files and building an index from scratch, this is required at the start and if index or cache broke some way. This will be added as generation + 1 index and make all the previous ones irrelevant.

A note on garbage collection.

Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author.

The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn't and wants to remove that it should be still possible to do that:

dvc gc -c --grace-period=0

This is "I know what I am doing even thoufgh I just messed up flag" :)

The grace period might simplify some conflict resolution above. E.g. we might use normal listing in gc instead of index listing to collect orphane files. This should be discussed in #2325.

@Suor I don't have any additional comment to your proposal.
As to data storage format, have you considered bcolz? Maybe we could make use of that. Though, I remember that I had problems with installation at first time. Don't know how is the situation now.

@Suor good point about grace period. This assumption gives us the ability to relax the consistency model and not to think about scenarios when GC runs in parallel (which is the root cause of the inconsistency).

This is the best solution I've seen so far. I'm trying to find downsides and limitations of this approach to better understanding the risks. I'd appreciate if you can help me with that.

  1. Introducing grace period seems like a great way of localizing issues to the border of the grace period and to the cases when someone runs --grace-period=0. This localization significantly reduces the probability of corruptioninconsistancy but this is not the way to completely get rid of it. Please correct me if missed something.

  2. With the grace period, GC logic will be more complicated for users. You need to remember this grace period stuff all the time. This might be fine, an analogy might be JVM memory releasefree and actual collection time.
    2.1. Is it ok to introduce this additional complexity to users?
    2.2. Should we mark a file as needs-to-be-collected if grace period was not expired?

@dmpetrov Without grace period it would be hard to run gc automatically ever, since there could be some files that are not referenced by repo on machine where gc is run, but referenced from some new commits.

So we force our users to think about this instead of thinking about grace period. It could be ok solution, people will need to adjust there strategies to fit that and live with it. E.g. never push+remove local files until they get some age - a grace period from the other POV.

However, grace period has little since when we think about local cache, but this is returning to the point that mixing local and remote gc in single command with same syntax and semantics is dangerous, which I mentioned in gc discussion.

Was this page helpful?
0 / 5 - 0 ratings