Context: https://discordapp.com/channels/485586884165107732/485596304961962003/677178126916124702
The issue:
User created data dir which contained files of the same name, some were uppercase, some were lowercase.
Other user cloned the repository on MacOs and could not get images containing uppercase letters,
when those names ovelapped with other images having lowercase (the same) name.
Pulling on linux was fine.
Reproduction script:
#!/bin/bash
rm -rf repo storage git_repo erepo
mkdir erepo git_repo storage
MAIN=$(pwd)
pushd git_repo
git init --bare --quiet
popd
set -ex
pushd erepo
git init --quiet
dvc init -q
mkdir data_dir
echo data1>>data_dir/file
echo data2>>data_dir/File
dvc add data_dir
dvc remote add -d str $MAIN/storage
git add -A
git commit -am "initial"
git remote add origin $MAIN/git_repo
dvc push
git push origin master
popd
git clone git_repo repo
pushd repo
dvc pull
after pull, we should have file
and File
in target repo, which seems to not happen on MacOs.
So, turns out that by default, Windows and MacOs do not support case sensitive file names.
Some resources:
https://superuser.com/questions/165975/are-all-versions-of-windows-case-insensitive
https://apple.stackexchange.com/questions/22297/is-bash-in-osx-case-insensitive
It seems that this filesystem setting, so one should be able to configure case sensitive support.
It doesn't seem to me that we can do something about it at dvc level.
We could probably warn the user about this behavior, though the detection of such use cases could be resource consuming.
When such a situation might happen:
pull
repo created on LinuxSome research shows that we are not the only one affected by this.
Git users on mac might also be hit with this:
https://stackoverflow.com/questions/8904327/case-sensitivity-in-git
if there is no good way to solve it we should at least document it. Probably a separate article about Ensuring cross-platform compatibility
would be nice, which then could also be used linked in the troubleshooting guide.
But need to look into the git first, @pared the link that you've shared seems to note that it now forks fine in git. Might be missing something.
Yes, it seems so, though situation described there is about moving the names, and not coexisting same (from case insensitive perspective) names. I will check how git behaves when we check out repository with "same" filenames.
Ok, so after discussion with @efiop I write down some summary of what happend to better pinpoint the problem.
dir
could be shown as:├── file
└── FILE
dir
. What he received on output, wasCould not create '{md5}.dir.unpacked'
(originating from here)The problem was that the number of data files did not match (for MacOs with default setting file
and FILE
is the same, so checkout
was unintentionally overriding one with another).
Why did unpacked dir creation fail? Our linking method fails if link exists, and from point of view of filesystem it did (check whether FILE
exists was successfull for file
that was created earlier). But unpacked
creation failure does not cripple dvc
critically so we just log a warning.
We could improve the detection of such cases if we would include filenames in the Exception that is thrown upon linking. In this case, the warning could hint us but was too generic, we should probably prepare our own error including both filenames, and print it upon unpacked
creation failure, even if its only warning.
Btw, in case of dvc add file
+ dvc add FILE
+ dvc checkout
, the issue won't be reproduced, because our checkout logic will see that file
exists and will remove it before checking out FILE
. Just a side note for myself.
@efiop, but in case of a case-insensitive system, you cannot create file
and FILE
.
@pared Did you close it intentionally?
I don't recall closing this, don't know what happend, sorry.
Ok, so it seems that git is handling this problem by warning the user about collision.
Here is sample output:
Cloning into 'test_repo'...
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
Unpacking objects: 100% (4/4), done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:
'FILE'
'file'
I guess it would be reasonable for us to do the same, though we should think on how to do that to not slow down any checkout because simply checking all filenames on checkout is O(n^2)
@pared Amazing research! Thank you so much! :pray: As to a reasonable solution, I suppose that git detects that collision while performing the checkout, right? From the wording, it seems like the ordering is completely random, so git doesn't care about it too much after all. We could do the same thing in dvc, for example, by playing around that defensive if self.exists()
in _do_link
, that has tipped us about this issue in the first place. Though I wonder if os.listdir
returns case-sensitive filenames :slightly_smiling_face: , if it does then the implementation might be even simpler.
@efiop
Though I wonder if os.listdir returns case-sensitive filenames, if it does then the implementation might be even simpler.
I suppose it depends on the system. As of current state of research, default settings of Windows and MacOs make them "Case insensitive, but case aware", so
os.listdir
will return "cased" paths but will treat FILE
as if it was file
.
I believe that implementing this logic in _do_link
is the best way to go now because we do the check there anyway, so improving the message sounds like a zero-impact solution. What worries me here is that in the original issue, I did not see the message Link '{}' already exists.
I will investigate why it happened.
Some further research:
this test:
def test_file(tmp_dir, dvc):
(tmp_dir / "data").mkdir()
(tmp_dir / "data"/ "DATA").write_text("asdfsdf")
DATA = os.path.join("data", "DATA")
data = os.path.join("data", "data")
from dvc.path_info import PathInfo
assert dvc.cache.local.exists(PathInfo(DATA))
assert not dvc.cache.local.exists(PathInfo(data))
passes on my Windows computer.
That means that even though filesystem considers data\data
and data\DATA
the same filename when writing,
os.path.lexists
does not.
[EDIT]
It gets better:
this test does not pass not only for MacOs on travis, but also Windows. I guess that system configuration plays a part here.
So, it turns out that git is checking if files collide by accessing the inode of given file. Collision is detected, when the inode has been already taken by another file.
EDIT:
Here is where the check happens:
EDIT 2:
It seems to me that I was wrong, and deciphered logic wrongly, actually what git does, here: when checking out some cache entry, git is comparing its name with all other entries. The comparison is using _stricmp
which is case insensitive.
EDIT 3:
In case of macOs inode check actually takes place
I think we could proceed with this one by implementing our own exists
basing on access to inode
from System
. In the case of linux systems, it still boils down to calling lstat
(that's what os.path.lexists does
). In the case of NTFS we would change our approach to relying on getdirinfo
. I need to still confirm whether this approach will be a viable solution for MacOs.
EDIT:
My latest build from #3411 seem to confirm my supposition.
My proposed solution on how to improve detection of collision when linking:
change the implementation of remote/local.exists
to call dvc.system.System.inode
instead of os.path.lexists
. In the case of FileNotFoundError
we can return False. In every other case, we return True.
@efiop Do you think I missed something?
@efiop Do you think I missed something?
Dunno, you are the one researching :slightly_smiling_face:
In the case of linux systems, it still boils down to calling lstat (that's what os.path.lexists does).
So current exists
calls lstat
and now we need to check inode instead, which will need to call lstat
. I guess I'm confused what's changing there. Or do you mean that it works fine on *nix but on windows cpython's lstat
implementation is not sufficient?
One more summary to grasp current state of this issue:
case sensitive system: system that is able to:
lowercase
path, eg file
and FILE
case insensitive system:
dvc version
output:DVC version: 0.87.0+f9bebd.mod
Python version: 3.7.1
Platform: Windows-10-10.0.18362-SP0
Binary: False
Package: None
Filesystem type (workspace): ('NTFS', 'C:\\')
In some cases, macOS is also considered case insensitive, because usually its filesystem is configured to be case insensitive
.
Worth noting that those systems usually allow you to create FILE
, because they are case preserving
, but they will override existing file
/FiLe
or any other combination, because, from the system point of view, they have the same name.
As of today (dvc
version 0.87.0
), we are unable to handle properly checkout
of following data structure:
├── file
└── FILE
This situation happens when we create our data folder on case sensitive system and try to work with the project on case insensitive system.
checkout
in such situation?dvc
will log warning that it was unable to create unpacked
directory. This will happen because Windows will be unable to create hardlink for one of the files because from system point of view it will already exist.dvc
will still checkout
data
dir but it will contain only one of (file
, FILE
).Make the user aware that the situation with overlapping paths occurred.
_do_link
whether path already exists, why there was no error when user was checking out? Because lexists
that is used to check whether path exists is case sensitiveFirst potential solution (credit to @efiop) would be to detect such overlapping the same way as git does and warn the user about the problem.
Git is doing it by comparing checked out entries with all other existing entries. Comparison, in case of windows is done by case insensitive method. This check happens only on clone. In case of MacOs, the git will obtain stats for given path and make check with inode.
I wrote simple test checking whether same lowercase
names will have same inode. It seems to be working on MacOs and Windows on Travis, but, for example on my windows computer this test does not pass (createFileW
cannot find the target file). So trying to get inode cannot be relied on in all Windows systems.
It seems that what might work here would be to do as git does: check inode in case of macOs and on windows, when linking, we could check directory content and compare lowercase filenames to see if collision occurs.
The problem with such an approach is that if we will do it in _do_link
we might end up slowing checkout execution time for windows.
@pared thank you for such a deep investigation!
This is an important problem and it would be nice if we can fix it but I’d like to understand the priority of the issue.
When we discuss pros/cons of this solution we should keep in mind another option - keep this responsibility on users shoulders. In many cases, this problem has an easy workaround for users (rename some files). The same happens in many multi-platform programming frameworks (even plain C) - you need to follow some conventions to make a framework work in different platforms (and filename is on of the conventions).
For now, the priority does not look like p0. Can we make it p1 or p2?
Please let me know if I missed something about the priority.
@dmpetrov p0 was the research, as we really needed to fully understand what is going on. The solution is lower in priority.