Dvc: status: granular output for directories

Created on 24 Jun 2019  路  10Comments  路  Source: iterative/dvc

Imagine dvc tracking a directory full of images and files containing labels.

If I delete/change/update a label or image, dvc will tell me that something in the directory has changed, but not exactly what. It would be nice is dvc status could be made more granular, returning something like

data.dvc:
    outputs:
        data/foo: new
        data/bar: deleted

instead of

data.dvc:
    changed outs:
        modified:           data

as it is today.

Thanks:)

feature request p2-medium

Most helpful comment

@efiop Thanks for the update! :smiley:

All 10 comments

Entry points where to start looking at this:

1) Here is where the status is being created https://github.com/iterative/dvc/blob/0.41.3/dvc/output/base.py#L178
2) Need to make Output.changed_checksum() either more granular to report particular files in the directory that has changed or create even a new method that will return a status dict for a directory;

we should be careful with this, imagine a directory with 1M new files. Probably we don't want to show all of them. I would say we should show a summary by default or at least on some threshold (on number of changed files per directory)

  • [ ] Once this is implemented (or in parallel), I recommend we deprecate the -R/--recursive option, as dvc status seems to always be recursive (to a certain limit?) already. Just being able to give any dir as target would suffice (full context: https://github.com/iterative/dvc/issues/4191).

I agree with @shcheklein. We could probably set some small default threshold of files to show (like 5, to not pollute the terminal) and add option to show all changes.
In case of default dir status I wouldn't display it as file -> status mapping, but rather
status -> list of files with that status just to save some space (in case of explicit show-all-changes it makes more sense to me to show file -> status mapping, as one probably does not want to scroll through the terminal to see the status of a particular file.

We could probably set some small default threshold of files to show (like 5

With pagination?

I wouldn't display it as file -> status mapping, but rather status -> list of files

Agree. I think we do this in other commands already like metrics/params show/diff? And other output

With pagination?

I would refrain from that in the default case. By default, I think it's better to "show the status of dir, but if it's small enough, we can go for particular files" rather than make user interact with it.

Agree. I think we do this in other commands already like metrics/params show/diff? And other output

Yes, we do, I was just referring to @MikkelAntonsen example.

What is the progress of this feature request? :smiley:

@lefos99 Not implemented yet :( To implement it, we'll need to change https://github.com/iterative/dvc/blob/79e8e4edafb8a4745200f0a62f49c29ef074bdea/dvc/output/base.py#L204 to compare self.dir_cache with self.get_checksum().dir_info(both are just lists of simple dicts) and provide granular status instead of a current generic one.

@efiop Thanks for the update! :smiley:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gregfriedland picture gregfriedland  路  3Comments

mdscruggs picture mdscruggs  路  3Comments

shcheklein picture shcheklein  路  3Comments

tc-ying picture tc-ying  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments