Context:
$ dvc -V
0.77.3+e03136
# all txt files are data files
$ ls
newfile.txt small.txt ttt.txt
newfile.txt.dvc small.txt.dvc ttt.txt.dvc
$ echo qwert >> small.txt
$ dvc add small.txt
$ git add small.txt.dvc
$ git commit -m hello
$ dvc diff HEAD^
dvc diff from ada9d97 to 1a22314 . # <<-- "diff ada9d97..1a22314" might be enough
diff for 'small.txt'
-small.txt with md5 5f73f210b6202aa07279cdff6776b64f
+small.txt with md5 b4301ef665d0343e028fcd5821654edc
added file with size 6 Bytes . # <<-- File was NOT added, it was modified. 6 bytes is diff, not size.
diff for 'ttt.txt'
-ttt.txt with md5 92a47ab38589702d290647ce24386181
+ttt.txt with md5 92a47ab38589702d290647ce24386181
file size was not changed . # <<-- what is a reason to output this then? Also, is it about the previous diff or the next one - hard to read?
diff for 'newfile.txt'
-newfile.txt with md5 99370a5386cac00c93c9fdc836076a7d
+newfile.txt with md5 99370a5386cac00c93c9fdc836076a7d
file size was not changed # <<-- what is a reason to output it then?
Also, it would be great to simplify this output for parsing.
Possible "clean" output:
$ dvc diff HEAD^
diff ada9d97..1a22314
Changed 'small.txt' +6 bytes
md5 5f73f210b6202aa07279cdff6776b64f b4301ef665d0343e028fcd5821654edc
New 'fakefile.txt' +13728494 bytes
md5 92a47ab38589702d290647ce24386181 99370a5386cac00c93c9fdc836076a7d
Removed 'fakefile_2222.txt' +8326436 bytes
md5 e0313660cfc07a10417543b8e4d08bea9 f4daf3bacc9a9494df7d641f572e92b66
Actions:
--porcelain
EDIT 12/21/19:
The same requirements should apply to directories:
$ dvc diff -t dir1 HEAD^
dvc diff from 025a8b3 to 41d8e27
diff for 'dir1'
-dir1 with md5 7492f47e0a1908a80d942a01702e6b8b.dir
+dir1 with md5 8f8824fbd7b1107ef69c85ab5db82973.dir
4 files untouched, 0 files modified, 1 file added, 0 files deleted, size was increased by 37 Bytes
Actions:
--porcelain
(like in Git) or --to-json
to make output format easy-to-parse for scripts.EDIT 12/24/19:
Another request just came https://github.com/iterative/dvc/issues/770#issuecomment-568642873 The request is related to data dirs - dvc diff
does not go deeper in data dirs. This can be considered as a part or an enhancement of this issue.
Also, it seems like data file checksums should not be shown by default. Let's prioritize human readability like in Git. An option can add checksums.
The output could look like:
$ dvc diff HEAD^
diff ada9d97..1a22314
New: 'fakefile.csv' 751 Mb
New: 'fakefile.txt' 2 Kb
Removed: 'fakefile_2222.txt' 536Kb
Modified: 'small.txt' 36Kb (+6 bytes)
Modified: 'hello.txt' 351 Mb (+12 Mb)
Modified: 'hello.csv' 631 Mb (-3.4 Mb)
Modified dir: 'images'. 116 files untouched, 2 modified, 1 added, 1 deleted.
New: 'DM3SZ.jpg` 2.1 Mb
Modified: 'B8PZM.jpg' 2.5 Mb (+158 Kb)
Modified: 'T79JK.jpg' 1.7 Mb (-0.2 Mb)
Removed: '4USKW.jpg' 2.6 Mb
$ dvc diff HEAD^ diff ada9d97..1a22314 Changed 'small.txt' +6 bytes md5 5f73f210b6202aa07279cdff6776b64f b4301ef665d0343e028fcd5821654edc ...
How about just using the first 7 digits, similar t a short SHA?
md5 5f73f21 b4301ef
or the rel. path in cache
path 5f/73f210b6202aa07279cdff6776b64f b4/301ef665d0343e028fcd5821654edc
or nothing (unless a special flag is used), like in the latest example:
$ dvc diff HEAD^ diff ada9d97..1a22314 New: 'fakefile.csv' 751 Mb Removed: 'fakefile_2222.txt' 536Kb Modified: 'small.txt' 36Kb (+6 bytes) Modified dir: 'images'. 116 files untouched, 1 modified, 1 added, 1 deleted. New: 'DM3SZ.jpg` 2.1 Mb Modified: 'B8PZM.jpg' 2.5 Mb (+158 Kb) Removed: '4USKW.jpg' 2.6 Mb
Other questions about ^ that example:
--recursive
/-R
option with default value of 1 to avoid crazy long diffs? In that case only "files untouched, modified, added, deleted" would be printed for dirs in the last level.--recursive=-1
to only print the summary, keeping the output super short.)-h
human readable file size units (e.g. with humanize.naturalsize(bytes, gnu=True)
from this pak.$ dvc diff -R 2 HEAD^
diff ada9d97..1a22314
0 files/dirs untouched, 2 modified, 1 added, 1 deleted.
New: 'fakefile.csv' 751 M
Removed: 'fakefile_2222.txt' 536 K
Modified: 'small.txt' 36K (+6 bytes)
Modified dir: 'images'. 116 files/dirs untouched, 7 modified, 1 added, 1 deleted.
New: 'DM3SZ.jpg` 2.1 M
Modified: 'B8PZM.jpg' 2.5 M (+158 K)
Removed: '4USKW.jpg' 2.6 M
Modified dir: 'faces'. 0 files/dirs untouched, 1 modified, 1 added, 0 deleted.
New: 'pokerface.jpg` 3.7 M
Modified dir: 'animals'. 0 files untouched, 0 modified, 100 added, 0 deleted.
Modified dir: 'faces'. 3 dirs untouched, 1 modified, 1 added, 1 deleted.
Modified dir: ...
...
$ dvc diff -R -1 HEAD^
diff ada9d97..1a22314
0 files or directories untouched, 2 modified, 1 added, 1 deleted.
Also, do we plan to recognize when "1 added, 1 deleted" means that a file was moved, and count it as a single change instead? Like Git does some times. This is probably overkill.
@jorgeorpinel thank you for the questions - all are great points!
- Should there be new lines between New, Modified, and Removed inside directories (indented levels) or vice versa, no new lines in he root level? For consistency
Well... possibly. But instead, I'd remove all the new lines for consistency. See git status
might be an example.
- Should there be a
--recursive
/-R
option with default value of 1 to avoid crazy long diffs? In that case only "files untouched, modified, added, deleted" would be printed for dirs in the l ast level.
-R 1
does not look natural. It would be great to see examples from other diff-like commands.
Thank you for bringing this. The problem that I see now - all these spaces as prefixes won't look good at 3rd+ level. We should think about gettig rid of the space-prefixes. It might look like:
$ dvc diff HEAD^
diff ada9d97..1a22314
New: 'fakefile.csv' 751 Mb
Removed: 'fakefile_2222.txt' 536Kb
Modified: 'small.txt' 36Kb (+6 bytes)
Modified dir: 'images'. 116 files untouched, 1 modified, 1 added, 1 deleted.
New: 'images/DM3SZ.jpg` 2.1 Mb
Modified: 'images/B8PZM.jpg' 2.5 Mb (+158 Kb)
Removed: 'images/4USKW.jpg' 2.6 Mb
New: 'images/outliers/NRD8E.jpg` 1.3 Mb
New: 'images/outliers/W84N2.jpg` 2.0 Mb
- Should the root diff also print the "files untouched, modified, added, deleted" summary on top? (With this you could use
--recursive=-1
to only print the summary, keeping the output super short.)
I was thinking about the summary only for data dirs. I don't think it is needed for a regular ones and for dirs inside data dirs (see files in images/outliers/
above - no summary). --recursive=-1
- might be an option - is there any examlpes?
- I would rec. GNU-style
-h
human readable file size units (e.g. withhumanize.naturalsize(bytes, gnu=True)
from this pak.
馃憤 We might show no sizes by default and the numbers with -h
.
--recursive=-1
- might be an option - is there any examlpes?
My example was
$ dvc diff -R -1 HEAD^
diff ada9d97..1a22314
0 files or directories untouched, 2 modified, 1 added, 1 deleted.
But you also said
I was thinking about the summary only for data dirs.
Which makes sense, so not sure we want to provide this after all.
-R 1
does not look natural. It would be great to see examples from other diff-like commands.
git diff
for example can print very, very long diffs when there's lots of changed files, and it does it without indentation like you suggest (with absolute paths). No recursion/level option is available. (It does have some other options to make the output smaller like --minimal
and --name-only
though.)
@jorgeorpinel I got your point and -R
might be a good option. But I was asking about examples in some popular unix tools like diff
, ls
, grep
, git
or similar.
Do you have any examples like this? It's okay if not.
Ah no. I don't think -R '-1'
exists anywhere haha. I guess it's a little strange. We could also just shift it up to 0 and make the default 1. Just some thoughts about the recursive option but these refinements are probably secondary to the main behavior and could have their own separate issue (lower priority).
I had a great chat with @mroutis re this issue. I'm putting down a summary...
(1) Agreed to use this-like format structure:
Added:
firmograph.txt
src/dvc.ico
data/dir1/foo.txt
data/dir1/bar.csv
Modified:
source.csv
data/ <<--- note, this is a data dir and this is why it is in output.
data/lorem.txt
data/ipsum.txt
data/dir1/somefile.txt
data/dir1/file2.txt
Deleted:
dir/ <<--- note, this is a data dir and this is why it is in output.
dir/file1.txt
(2) Json output should reflect this structure with "added", "modified", "deleted" on the top and list of changes. For data directories Json also needs to include "is_dir": True
signal or just /
suffix in name "name": "dir1/"
or both.
(3) File sizes are not needed now. To solve the file sizes problem holistically we need to include file sizes in dvc-files (when we compute checksums) and then reuse it in other commands including diff.
(4) Checksums are not needed by default. But it would be great to have an option like --show-checksum
to include them into the output and json.
(5) It would be great to see a diff between HEAD and the current workspace by default (like in Git).
(6) [We haven't' discussed this but it might be still important] We probably need a header like in the first line of output diff 43168e0..3f7ad04
as well in json "diff" {"from": "43168e07a3f2d5ea86eca8e1b1c926f90229d9fb", "to": "3f7ad041875321fee33dead9b9b6b073b139b943"}
.
@mroutis thanks for the discussion and please comment if you have something to add.
(6) [We haven't' discussed this but it might be still important] We probably need a header like in the first line of output diff 43168e0..3f7ad04 as well in json "diff" {"from": "43168e07a3f2d5ea86eca8e1b1c926f90229d9fb", "to": "3f7ad041875321fee33dead9b9b6b073b139b943"}.
@dmpetrov , what Git does is to show the different checksums from each file, not the trees that you are comparing.
Repeating the arguments at the output seems excessive, since you already know where the diff comes from.
Adding such information to the JSON output is a one-liner. Let's add it just if it is needed :slightly_smiling_face:
note, this is a data dir and this is why it is in output.
Is this a note for us here or an actual part of the output? Not sure I get it either way.
I like this proposed structure and format but this new direction for dvc diff
looks more like git status
than git diff
. Should we be talking about dvc status
in fact?
see a diff between HEAD and the current workspace by default
Yes! But again, kind of like git status
?
p.s. I see this latest format is consistent with https://github.com/iterative/dvc/issues/2995#issue-541398130 too, which is nice. I like consistency 馃檪
Yes! But again, kind of like
git status
?
@jorgeorpinel good point about dvc status
. In Git, status
and diff
do similar things but diff
is focused on file content and versions/checksums (HEAD^^
) while status
is about workspace/staging state. We are not showing the content diff in DVC, so the difference is blurring and there is a chance that we are mixing these commands a bit.
What we need in this functionality is checksum1
vs checksum2
difference. So, it is closer to git diff
from this point of view.
If we decide to incorporate dvc diff
in dvc status
it will require command syntax/params change and we might lose the connection with the Git commands. I'd suggest keeping dvc diff
as a separate command for now and later we can merge them we find a strong reason.
Is this a note for us here or an actual part of the output? Not sure I get it either way.
It is the actual output. We just don't want to lose the information about data/not-data directory that might be important for UI.
What we need in this functionality is checksum1 vs checksum2 difference. So, it is closer to git diff from this point of view.
But actually I believe dvc diff
is currently being reworked to behave like the new dvc metrics diff
, where by default it compares HEAD vs the working tree (local changes) @dmpetrov. (Per #2999? Not sure)
I'd suggest keeping dvc diff as a separate command for now and later we can merge them we find a strong reason.
I agree totally! My suggestion is that maybe we should be reviewing the behavior of dvc status
and not that of dvc diff
here.
It is the actual output.
OK but I don't get the meaning of that note. Does it mean e.g. data/
was not added to DVC as a directory, but it's being shown because files inside were? If you can explain I can suggest another phrasing for these short notes. Cc @mroutis
@jorgeorpinel I've created #3255 for discussing diff vs status and possible merge.
Does it mean e.g.
data/
was not added to DVC as a directory, but it's being shown because files inside were? If you can explain I can suggest another phrasing for these short notes. Cc @mroutis
You are right. If anything was change inside a data dir - the dir will change and we don't want to lose this information (from output).
Also created #3256 re the file sizes.
Thanks!
note, this is a data dir and this is why it is in output
Does it mean e.g. data/ was not added to DVC as a directory, but it's being shown because files inside were?
If anything was change inside a data dir - the dir will change and we don't want to lose this information (from output).
Sorry, I still don't really get it. I don't think the note was added in #3229 anyway (cc @mroutis)
@jorgeorpinel I think note is not part of the output (check the PR https://github.com/iterative/dvc/pull/3229/files - this the fastest way to answer this kind of questions).
The way I understood this:
If data
dir itself is under DVC control and we add/delete files inside it and/or add/delete the directory we need to show _it_ in the output along with changed files inside.
@dmpetrov @mroutis feel free to correct me if I'm wrong :)
OK, I think I finally get it. My rec text for this note is something like "Entire dir tracked by DVC, specific changes listed below."
so, since this not is not part of the output (I hope), we can save this phrase for the docs :)
Yep, @shcheklein , you are right; I tried to explain it here, @jorgeorpinel : https://github.com/iterative/dvc/pull/3229/files#diff-2ba4ed2b1831b35693f1715dfb9c9232R21-R32
If data dir itself is under DVC control and we add/delete files inside it and/or add/delete the directory we need to show it in the output along with changed files inside.
Has this been really addressed by #3229? It appears to me that changes within DVC-tracked directories are still opaque in diff
output.
Hi @mys007 ! Could you share the reproduction script, please?
Sure, please run this in an empty directory:
git init
dvc init
mkdir one
touch one/first
touch one/second
dvc add one
git add one.dvc .gitignore
git commit -m "initial"
rm one/first
dvc add one
git add one.dvc
git commit -m "removed a file"
Then dvc diff HEAD^
prints just a summary without any indication what happened in "one":
Modified:
one/
files summary: 0 added, 0 deleted, 0 modified
The same for dvc status
if called after rm one/first
.
I'm using the current DVC master code.
@mys007 Oh, that's bad :( Reopening.
In addition to this... it seems like missing cache files are not handled properly:
$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
$ dvc diff
Deleted:
data/data.xml
data/features/
data/prepared/
model.pkl
files summary: 0 added, 2 deleted, 0 modified
# But there are no deleted files. Just missed caches.
$ dvc pull
$ dvc diff HEAD
# no diff - an expected result
Ideally, DVC should get the difference without cache by using just metafiles. An exception is data dirs (from cache). Data dirs should be pulled automatically during dvc diff
(if there are no special options provided).
Related to this issue - we should store file size in addition to hashes. It will give us an ability to show file sizes in dvc diff
.
Actions for dvc diff
:
Ideally, DVC should get the difference without cache by using just metafiles. An exception is data dirs (from cache). Data dirs should be pulled automatically during
dvc diff
(if there are no special options provided).
Just the clarify this, is the desired behavior here to fetch
and checkout
a data directory if the entire directory is missing from local cache (but only if the entire directory is uncached)?
The reason I'm asking is because if we get into behavior where diff
checkout's files by default, it can become unclear when a file is missing (and should be pulled/checked out) vs when a file is actually deleted.
ex if diff always pulls data dirs by default, assuming we have a dvc-tracked data directory containing [dir/file1, dir/file2]
$ dvc diff
# pulls contents of dir/ by default
# no changes (expected)
$ rm dir/file1
$ dvc diff
# automatic pull would re-checkout dir/file1
# no changes (unexpected)
In this scenario presumably dir/file1
should be considered deleted and not missing.
In the second diff
command, we could assume that dir/
is not supposed to be pulled/checked out since the local cache entries exist at that point (since they were fetched/pulled by the first diff
). But diff
pulling files at all seems a bit unintuitive to me.
@dmpetrov
After discussion with @dmpetrov, we decided that diff
should be aware of an additional "not in cache" file state (like dvc status
). So for the example-get-started scenario where nothing has been pulled yet, we would show that 0 files have been deleted, but several files are not in cache (and still need to be pulled). This state will be displayed by default, but can be hidden by a flag.
Bugs for this issue have been resolved, there is an existing separate issue for the file size related changes: #3256
Hi!
additional "not in cache" file state (like dvc status)
So are the states the same for status and for diff? Is it "not in cache" or is it "missing" (neither in cache nor in remote when -rc are used)?
@jorgeorpinel The state for diff is "not in workspace and not in local cache". We don't check anything in remotes for diff other than trying to fetch dir cache (list of files that are supposed to be in a directory output) from a default remote (and that is only fetched if it is not already present in local cache).
Right! I was confused. I think let's try not to mix the concept of status
states and diff
"state" (wouldn't even call them that for diffs) to avoid these confusions. Won't be easy to explain that way in docs at least, as I commented in your docs PR 馃檪
Most helpful comment
@mys007 Oh, that's bad :( Reopening.