Dvc: CRLF and LF line endings cannot be modified after dvc add

Created on 2 Oct 2020  路  11Comments  路  Source: iterative/dvc

DVC CRLF issue

I've prepared a sample repository with the steps to reproduce this issue.
https://github.com/MetalBlueberry/dvc-crlf

DVC version: 1.7.2 (pip)
---------------------------------
Platform: Python 3.6.9 on Linux-5.3.0-28-generic-x86_64-with-Ubuntu-18.04-bionic
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p5
Workspace directory: ext4 on /dev/nvme0n1p5
Repo: dvc, git

Cache is committed to the repo so it can be viewed

Reproduce

  • init git and dvc
  • create a script.sh with CRLF line endings
  • dvc add script.sh
  • change CRLF ending to LF
  • try to add it again, (doesn't detect changes)

Obtained

script.sh is stored with CRLF in cache, so when you run dvc checkout it always comes back to CRLF line endings even if you modify it locally.

Expected

script.sh should keep the original CRLF line endings or at least store the cache content always with LF endings.

Consequences

  • CRLF and LF cannot be modified after first commit
  • md5 file name doesn't match the file content hash

Thank you

First reported in https://github.com/iterative/dvc.org/issues/1469#issuecomment-702129969

awaiting response

Most helpful comment

Thanks for the quick response:

  1. Regarding versioning, if i upgrade to dvc 2.0, what happens to my .dvc files and my cache? Are they going to suddenly be completely invalid?
  2. Regarding azure, I'm not referring to the etag but rather the content_md5 property. It is only filled if the uploading client includes it, which is true of both azcopy and the python client.

All 11 comments

Hi @MetalBlueberry !

This is a known behaviour of dvc https://github.com/iterative/dvc/issues/992 that we have anticipated might bite us in the rear at some point. Initially it was implemented that way to make sure that windows and *nix hashes of git files match, as git will automatically set CRLF on windows, unless you set autocrlf to false. In hindsight, we could probably check for that option automatically or at least document it and just stick with md5sum instead of dos2unix + md5sum.

That being said, dvc itself doesnt change the contents of the files it stores, so the collision happened when one file was pushed to the remote.

Could you elaborate on your scenario with CRLF, btw? Just want to have a better idea on your use case.

In my scenario we were trying to upload a XML file to dvc remote using a custom uploader. The problem is that the md5 checksum calculated by dvc doesn't match the standard md5 when the file contains CRLF line endings. Our custom uploader performs validations on md5 checksum to ensure that you cannot upload data with different sums.

From my point of view, dvc should not change the file content to calculate the md5 to ensure that the file hash matches. In the scenario that you describes, either the linux user or the windows user will be forced to work with CRLF / LF line endings. The worst thing is that two identical files with CRLF/LF swapped could not be uploaded to the same remote because they will have the same hash. As I describe in the example, this makes impossible to change the line endings once you push to the remote.

If windows users have problems with git changing to CRLF line endings, they could just change the git settings instead of having inconsistent file hashes in DVC.

Other idea could be to enable/disable this behaviour with a parameters, We could have this behaviour enabled by default for backward compatibility but allow the users to disable it for consistency.

@MetalBlueberry That sounds good. And in 2.0 we could consider switching to that behaviour by default (we actually had some plans to do that in 1.0, but didn't get to that). Here is the piece in the code that does it https://github.com/iterative/dvc/blob/d0e1c887926e28e40d677e9b1a9421796b117b53/dvc/utils/__init__.py#L38 .

Yep, agreed! Providing an option is a great idea. Should we repurpose this ticket for this?

This bit me too. I have a bunch of binary files that dvc thinks are text files even though they aren't. So the md5 hash calculated by dvc doesn't match the one calculated by azure or any other tool that just hashes the bytes without line-ending conversion.

It seems like this kind of approach makes your md5 hashes really brittle. Any bug fix to istextfile or file_md5 might break users' caches.

I would _strongly_ suggest that the stored md5 hashes come with a version number that indicate what version of the algorithm they used. That way, if the algorithm is updated and improved, the old hashes aren't suddenly useless.

E.g., currently a .dvc file looks like:

outs:
- md5: cbc3c863de715aaa1c83b4c73a4baa20
  path: 'my_amazing_binary_file_dvc_thinks_is_txt.bin'

If dvc is fixed in 2.0 so that this is properly recognized as binary, the md5 hash above will be wrong.

Instead, consider:

outs:
- md5: cbc3c863de715aaa1c83b4c73a4baa20
  md5_hash_version: 1.0
  path: 'my_amazing_binary_file_dvc_thinks_is_txt.bin'

If the hash algo changes, just create a "1.1" or "2.0" version, but keep the ability to interpret a 1.0 hash code for older .dvc files.

Wow this kind of bug makes me nervous.

@gamis Thanks for the feedback! We plan on changing the cache format in 2.0 https://github.com/iterative/dvc/issues/4841 and plan on dropping dos2unix at the same time, so there is no cache mixup. Once we turn off dos2unix by default, there is not much reason to introduce md5 versioning, since it will be true md5 anyways. That said, we are also considering switching to a different default hash type, because people tend to be scared of md5, even though we are not doing any cryptography with it.

Regarding the azure etag, I don't think even azure guarantees it to be an md5, it just happens to match it sometimes. Same with s3's etag. So you probably shouldn't rely on it too much. Or am I missing something?

Thanks for the quick response:

  1. Regarding versioning, if i upgrade to dvc 2.0, what happens to my .dvc files and my cache? Are they going to suddenly be completely invalid?
  2. Regarding azure, I'm not referring to the etag but rather the content_md5 property. It is only filled if the uploading client includes it, which is true of both azcopy and the python client.

Regarding versioning, if i upgrade to dvc 2.0, what happens to my .dvc files and my cache? Are they going to suddenly be completely invalid?

No, 2.0 will likely be backward compatible with old dvc files and cache, at the very least in read-only mode. If something needs to be broken, we'll either migrate automatically or will provide a migration instructions/script. Keep in mind that we still need to be able to read old metrics/plots/params and do some other stuff with project's history, so we kinda have to have at least a read-only backward compatibility. :slightly_smiling_face:

Regarding azure, I'm not referring to the etag but rather the content_md5 property. It is only filled if the uploading client includes it, which is true of both azcopy and the python client.

Ah, makes sense :+1:

@gamis

Regarding azure, I'm not referring to the etag but rather the content_md5 property. It is only filled if the uploading client includes it, which is true of both azcopy and the python client.

Could you clarify please what is the purpose of this field (if it's a user provided one) and how do people use it?

In my case, I'm using gocloud/blob, which is a cross provider blob storage.

I also have the Content-MD5 field and it is used to check the integrity of the uploaded file. If the calculated hash doesn't match the user hash, the upload fails. https://godoc.org/gocloud.dev/blob#WriterOptions

it is also a standard header used to check the integrity of the data transmission. https://tools.ietf.org/html/rfc1864

For my specific use case, I'm using the value to upload files to a dvc repository through an API. The hash is used not only for integrity validation but to open a connection with the storage site at the beginning of the upload to stream the file contents. The md5 is required because it is the file structure on the remote.

@efiop

we are also considering switching to a different default hash type, because people tend to be scared of md5, even though we are not doing any cryptography with it.

I think that people tend to be scared is not a good reason to leave behind md5. It won't be easy to replace it with another algorithm while keeping backward compatibility.

Yes, what @MetalBlueberry said! Azure's azcopy will declare an upload failed if the md5 doesn't match. I use it to ensure that my local versions of the files are identical to those in Azure.

1+ re people being "scared" of md5. Maybe add a paragraph in the docs explaining why it's ok. There's not really any good reason to switch, though I do like blake2s!

Was this page helpful?
0 / 5 - 0 ratings