Dvc: Discuss whether we should have consistent output format for both `dvc status` and `dvc status -c`.

Created on 27 Aug 2020  Â·  4Comments  Â·  Source: iterative/dvc

So in case of our current test, if we commit modified alice we will get Data and pipelines up to date on dvc status
and we will get new: alice on dvc status -c. As this issue is about filtering the status I would keep output of dvc status -c as is and create issue discussing whether we should have consistent output format for both dvc status and dvc status -c.

_Originally posted by @pared in https://github.com/iterative/dvc/pull/4433#discussion_r477095154_

@skshetry we can discuss the output and format and the Stage::status() with filter_info here

discussion

Most helpful comment

@pared @skshetry
To question 1, first we should move the filter_info logic into Stage()::status. Currently, all the logics associated with it are all in the base level (stage, output, cache). And at the API level like checkout, we should only pass it as arguments to the lower level without actually handle it.
Screenshot from 2020-08-27 22-20-08

And the question is what info should we return for stage.status with a not none filter_info. And this depends on what we should show in DVC status [outputs], or should we follow the output format of DVC status -c [outputs]. Now I feel the question is that status -c is a totally different command from status. One compares the workspace to the HEAD snapshot with file path info, the other compares the cache with remote cache, no file path info needed. And yes they have a totally different logic in
https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/repo/status.py#L31-L38
https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/repo/status.py#L81-L109

,but shared with the same interface status. Now DVC status -c [outputs] shares the same output format with DVC status [stages], and DVC status [outputs] should share the same output format with DVC status [stages] not DVC status -c [outputs]

➜  tests git:(master) ✗ dvc status -c s1
        new:                alice                                                                                                                                               
    new:                bob
➜  tests git:(master) ✗ dvc status -c alice
        new:                alice    

All 4 comments

So in case of this issue there are actually 2 things to be done:

  1. unify the output of _joint_status so that it returns same structure, no matter whether we provide filter_info or not
  2. discuss whether we should unify output between dvc status and dvc status -c

I think 1. has been discussed under #4433 and we should always return the stage.status with or without flter_info. In case of unchanged files we probably don't have to mention that. Maybe in case of exact output match we could print "unchanged" but in other cases I think its better to report only "edited" statuses.

As to 2:
Lets consider following test:

def test_status(tmp_dir, scm, dvc, local_remote):
    tmp_dir.dvc_gen({"dir": {"file1": "file content", "file2": "file2 content"}}, commit="init")
    dvc.push()

    tmp_dir.gen({"dir": {"file1": "modified file content"}})
    s_local = dvc.status()
    dvc.commit("dir.dvc", force=True)
    s_cloud = dvc.status(cloud=True)

    assert s_local == {"dir.dvc": [{'changed outs': {'dir': 'modified'}}]}
    assert s_cloud == {'dir': 'new', 'dir/file1': 'new'}

Considering that both status and status --cloud is the same command, I think it would make sense that both would return unified status. On the other hand, docs specifically mention that in the case of --remote and --cloud we compare cache's content. And status from the cache point of view is something different than status from workspace point of view. In the case of the repo, we identify outputs by its path, so it's easy to detect whether something changed (if the file under the given path has different md5, it changed). But cache has no notion of outputs, it only has paths and files under it. So, in order to unify status output we would need to implement some logic making connection outputs -> remote so that we can detect that dir/file1 is modified. That would probably mean state for remotes, which would lead to performance degradation. I don't think it is worth it.
@karajan1001 What do you think?

@pared @skshetry
To question 1, first we should move the filter_info logic into Stage()::status. Currently, all the logics associated with it are all in the base level (stage, output, cache). And at the API level like checkout, we should only pass it as arguments to the lower level without actually handle it.
Screenshot from 2020-08-27 22-20-08

And the question is what info should we return for stage.status with a not none filter_info. And this depends on what we should show in DVC status [outputs], or should we follow the output format of DVC status -c [outputs]. Now I feel the question is that status -c is a totally different command from status. One compares the workspace to the HEAD snapshot with file path info, the other compares the cache with remote cache, no file path info needed. And yes they have a totally different logic in
https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/repo/status.py#L31-L38
https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/repo/status.py#L81-L109

,but shared with the same interface status. Now DVC status -c [outputs] shares the same output format with DVC status [stages], and DVC status [outputs] should share the same output format with DVC status [stages] not DVC status -c [outputs]

➜  tests git:(master) ✗ dvc status -c s1
        new:                alice                                                                                                                                               
    new:                bob
➜  tests git:(master) ✗ dvc status -c alice
        new:                alice    

Now DVC status -c [outputs] shares the same output format with DVC status [stages], and DVC status [outputs] should share the same output format with DVC status [stages] not DVC status -c [outputs]

@karajan1001
I think this is a great summary of the whole issue.

Thank you @karajan1001 for handling this!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

siddygups picture siddygups  Â·  3Comments

dnabanita7 picture dnabanita7  Â·  3Comments

prihoda picture prihoda  Â·  3Comments

nik123 picture nik123  Â·  3Comments

TezRomacH picture TezRomacH  Â·  3Comments