From https://github.com/iterative/dvc/pull/4518#discussion_r482763844
When local workspace is dirty, the existing dir cache and hashes are reused for dir outputs in DvcTree (and RepoTree by extension). If the output is dirty (new/modified/removed files in the directory) get_hash()/get_dir_hash() should re-compute the hash rather than using the existing (clean) hash from cache/state.
Once DvcTree is updated to handle dirty workspace in this scenario, dvc diff should be updated to use RepoTree.get_hash() everywhere.
should re-compute the hash rather than using the existing (clean) hash from cache/state.
Just to clarify that state is for dirty repos too, we should use it to avoid recomputing multiple times for the same dirty file. That should give us a pretty good performance even for dirty repos.
Related bug: DvcTree.walk in a dirty workspace still yields all file names in the original dir cache for the directory out, even if a nested file has been deleted in the local workspace. This makes diff ignore deleted files inside an output dir. Even though diff can see that the dir has been modified, it can't see which individual files have been removed.
example-get-started git:master py:dvc ❯ dvc diff
example-get-started git:master py:dvc ❯ rm data/features/test.pkl
example-get-started git:master py:dvc ❯ dvc diff
Modified:
data/features/
files summary: 0 added, 0 deleted, 0 modified, 0 not in cache
Resolving the original issue and re-computing dir cache for the dirty workspace will fix this bug without needing to modify DvcTree.walk, since it should be yielding filenames from the the updated (dirty) dir cache file list.