Dvc: optimize dir checkout

Created on 12 Sep 2019  路  31Comments  路  Source: iterative/dvc

c21-full-week enhancement p1-important performance research

Most helpful comment

Discussed this with @efiop and @pared. Came to the agreement that to make it fast and decrease complexity we will:

  1. Make dvc checkout not bother with links types (basically revert to the old behavior)
  2. Provide dvc checkout --relink for a user to manually force relinking.
  3. Provide a HINT after cache type change.

All 31 comments

If we include mode and link type into dir checksum then we will be able to still use dir.changed to shortcut everything. The question is how much slower that will dir checksum make.

Side consideration. We should really add some performance metrics, this becomes less and less manageable to track.

@Suor if we would include that, then things like dvc status will start reporting those files as changed, which is not really desirable, since for status/repro we don't really care if the file is of correct type. We only care about it on re-adding and checkout, really. But we could definitely look in that direction to optimize it along the way.

Side consideration. We should really add some performance metrics, this becomes less and less manageable to track.

Agreed, we've been talking about setting that up on, for example , dvc-test repository, but just didn't have the time to deal with it yet, unfortunately :slightly_frowning_face:

I think bench dir here would be easier to maintain then some separate repo.

For the record, discussed dir vs dvc-test approaches and didn't reach the consensus. To me, it makes sense to have those in dvc-test where they will actually be run regularly and their development won't create noise on the main core repo.

Need to do the research, figure out what is slowing us down and what are the options for us.

Seems that this issue might evolve into few smaller ones.
In my opinion, we could start by trying to shortcut the directory check described in original comment For checking out unchanged directory.

In #2488 we dropped shallow dir checksum check, which caused us to mark dirs that might have changed access level or link type as unchanged.

I think that we could implement kind of shortcut for current dir checkout, at least in case of "hardlink" and "symlink".
In those cases, we could simply store directory mtime and size in state db, retrieve it and compare it with existing upon checkout. In case of access change (dvc unprotect data_dir/file) link type of file will change, so will mtime and size.

Regretfully this case is not as easy for "copy" and "reflink", since unprotecting will result only in access rights change which is not reflected in get_mtime_and_size.

In the latter cases, it does not seem to me that we would be able to check whether any file did have its file access level changed without checking each one of them.

I also did performance tests for the case described in the aforementioned comment, respectively for each link type. Interestingly, adding 100K files, remove 1 and checkout for symlink cache type seems to take twice as much time as hardlink or copy, need to investigate why.

To keep history: we (with @efiop) discussed removing write permissions from directory upon dvc add dir in case of protected repo. This idea seems ok, for checks whether directory was modified, though with the current unprotect implementation it has a big drawback that cannot be ignored:

  • if someone would like to add some files to her/his dataset, one would need to unprotect the whole dir, resulting in copying all files inside
    For now, this idea is dropped.

One thing that could be optimized is reintroducing self.changed check that was removed here
But to make sure we also check link type changes and write permision changes, we should also gather mode for all files in dir and verify it is consistent with self.protected. Other thing to check would be that mtime did not changed for dir (if checksum and permission did not changed, changed mtime indicates that we are using *link cache type and that we did unprotect one or more files).

Reintroducing this check in new form would make sure that dir checkout for unchanged dir is faster.

Another one (thanks to @efiop) : introduce {X}PoolExecutor to remote.base._checkout_dir.

Ok, little summary:

  • We can try reintroducing check for unchanged directory, but it would need to be performed altogether with check that file modes are consistent with protected (could by done by getting st_mode during get_mtime_and_size) and that no file has been unprotected (mtime check should do that).
  • We can introduce Thread or Process PoolExecutor to remote.base._checkout_dir. Though it is not straightforward since _checkout_dir uses state db.

For the record, probably need to remove unpacked dir now.

@efiop status/push/pull/commit will become very slow again?

@shcheklein it is not used right now, as we are checking everything one-by-one anyways. Need to optimize in another way, this is what this ticket is about.

@efiop just to confirm are we doing this one-by-one check during the same "status" operation? (which affects push/pull/status, etc)?

@shcheklein Hm, on further inspection, I might be wrong. We are not using it on checkout, but we do seem to use it on other operations. So checkout after checkout is now slower than it was before.

@efiop not sure I got it. it - new check or unpacked dir? checkout after checkout?

How about shifting priorities here? Like make fast matter more than dedup and do not check link types on dir checkout? Basically what we had before. This is an easy change, all our performance tricks will work again. And if a user needs to dedup one might do that manually with:

rm -rf dir
dvc checkout dir.dvc

This is what I would prefer dvc doing unless we can somehow get both performance and dedup without significant code complication.

more than dedup

what kind of dedup are we talking about? Just

I would prefer dvc doing unless we can somehow get both performance and dedup without significant code complication.

sounds about right to me either

The reason for checkout becoming slower is that we now care about link types more, so if there is a copy of a file and we should have symlink then we remove it and make a symlink. Checking the link types of all the files in the directory prevents us from using fast directory changed check (calc its mtime and size, get checksum from sqlite and compare it to an expected one).

Link checks are there for a reason. We need to have both fast and granular. If you remember, unpacked dir was a quick fix, there is probably a proper solution too.

I can't see other reason then dedup for link type checks. Are there one?

@efiop can we indeed summarize the checks we have and the reasons behind them? Link checks are there for a reason. is not very meaningful or actionable answers, tbh.

@Suor @shcheklein

1) If old link type doesn't match new one, we need to replace old link with a new one;
2) If repo used to be unprotected but now it is - we need to protect the link;
3) If you have unprotected particular file, checkout should replace that copy with a proper link;

It all was in issues from the recent past.

@efiop cool! can we summarize now when do we have to run those checks? For example, it's not obvious why do we run it when we do dvc commit and especially if dir has been just added (so, not changes to it).

@shcheklein Sorry for short and snappy responses, but this needs to be properly investigated. For commit, it happens here https://github.com/iterative/dvc/blob/0.66.5/dvc/remote/base.py#L483 , where checkout looks at links and if they are changed - re-links them.

As @Suor suggested in the past, the solution here would be to quickly collect fs info in a bulk and then analyze it, which would make it much more efficient than it is right now and much easier to parallelize if needed.

Discussed this with @efiop and @pared. Came to the agreement that to make it fast and decrease complexity we will:

  1. Make dvc checkout not bother with links types (basically revert to the old behavior)
  2. Provide dvc checkout --relink for a user to manually force relinking.
  3. Provide a HINT after cache type change.

There is still a relink part.

@Suor Could you elaborate?

@efiop in the comment above points 2 and 3 are not implemented yet.

@Suor So, should we keep this open? If so, please reopen.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mdscruggs picture mdscruggs  路  3Comments

ghost picture ghost  路  3Comments

anotherbugmaster picture anotherbugmaster  路  3Comments

dmpetrov picture dmpetrov  路  3Comments

tc-ying picture tc-ying  路  3Comments