So in case of our current test, if we commit modified
alicewe will getData and pipelines up to dateondvc status
and we will getnew: aliceondvc status -c. As this issue is about filtering thestatusI would keep output ofdvc status -cas is and create issue discussing whether we should have consistent output format for bothdvc statusanddvc status -c.
_Originally posted by @pared in https://github.com/iterative/dvc/pull/4433#discussion_r477095154_
@skshetry we can discuss the output and format and the Stage::status() with filter_info here
So in case of this issue there are actually 2 things to be done:
_joint_status so that it returns same structure, no matter whether we provide filter_info or notdvc status and dvc status -cI think 1. has been discussed under #4433 and we should always return the stage.status with or without flter_info. In case of unchanged files we probably don't have to mention that. Maybe in case of exact output match we could print "unchanged" but in other cases I think its better to report only "edited" statuses.
As to 2:
Lets consider following test:
def test_status(tmp_dir, scm, dvc, local_remote):
tmp_dir.dvc_gen({"dir": {"file1": "file content", "file2": "file2 content"}}, commit="init")
dvc.push()
tmp_dir.gen({"dir": {"file1": "modified file content"}})
s_local = dvc.status()
dvc.commit("dir.dvc", force=True)
s_cloud = dvc.status(cloud=True)
assert s_local == {"dir.dvc": [{'changed outs': {'dir': 'modified'}}]}
assert s_cloud == {'dir': 'new', 'dir/file1': 'new'}
Considering that both status and status --cloud is the same command, I think it would make sense that both would return unified status. On the other hand, docs specifically mention that in the case of --remote and --cloud we compare cache's content. And status from the cache point of view is something different than status from workspace point of view. In the case of the repo, we identify outputs by its path, so it's easy to detect whether something changed (if the file under the given path has different md5, it changed). But cache has no notion of outputs, it only has paths and files under it. So, in order to unify status output we would need to implement some logic making connection outputs -> remote so that we can detect that dir/file1 is modified. That would probably mean state for remotes, which would lead to performance degradation. I don't think it is worth it.
@karajan1001 What do you think?
@pared @skshetry
To question 1, first we should move the filter_info logic into Stage()::status. Currently, all the logics associated with it are all in the base level (stage, output, cache). And at the API level like checkout, we should only pass it as arguments to the lower level without actually handle it.

And the question is what info should we return for stage.status with a not none filter_info. And this depends on what we should show in DVC status [outputs], or should we follow the output format of DVC status -c [outputs]. Now I feel the question is that status -c is a totally different command from status. One compares the workspace to the HEAD snapshot with file path info, the other compares the cache with remote cache, no file path info needed. And yes they have a totally different logic in
https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/repo/status.py#L31-L38
https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/repo/status.py#L81-L109
,but shared with the same interface status. Now DVC status -c [outputs] shares the same output format with DVC status [stages], and DVC status [outputs] should share the same output format with DVC status [stages] not DVC status -c [outputs]
➜ tests git:(master) ✗ dvc status -c s1
new: alice
new: bob
➜ tests git:(master) ✗ dvc status -c alice
new: alice
Now DVC status -c [outputs] shares the same output format with DVC status [stages], and DVC status [outputs] should share the same output format with DVC status [stages] not DVC status -c [outputs]
@karajan1001
I think this is a great summary of the whole issue.
Thank you @karajan1001 for handling this!
Most helpful comment
@pared @skshetry

To question 1, first we should move the
filter_infologic intoStage()::status. Currently, all the logics associated with it are all in the base level (stage,output,cache). And at the API level likecheckout, we should only pass it as arguments to the lower level without actually handle it.And the question is what info should we return for
stage.statuswith a not nonefilter_info. And this depends on what we should show inDVC status [outputs], or should we follow the output format ofDVC status -c [outputs]. Now I feel the question is thatstatus -cis a totally different command fromstatus. One compares the workspace to the HEAD snapshot with file path info, the other compares the cache with remote cache, no file path info needed. And yes they have a totally different logic inhttps://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/repo/status.py#L31-L38
https://github.com/iterative/dvc/blob/334556f07dc511927543218d4a2a1a1c1c83ed65/dvc/repo/status.py#L81-L109
,but shared with the same interface
status. NowDVC status -c [outputs]shares the same output format withDVC status [stages], andDVC status [outputs]should share the same output format withDVC status [stages]notDVC status -c [outputs]