When cloning a file larger than 4Gbyte from a BitBucket server repository (LFS enabled), the file is not reconstructed correctly E.g. a 6Gb file results in a 700Mb file. The lfs/objects folder contains the correct file however.
Steps to reproduce:
Server: Basic install of Debian with Bitbucket Server 4.6, Git 2.13 (64-bit)
Client 1: Ubuntu 16.02 64 bit, Git 2.13 and Git-LFS 2.2.1 (both 64-bit)
Client 2: Windows 64 bit (2012), Git 2.13 and Git-LFS 2.2.1 (both 64-bit)
Client 1 works correctly
Client 2 pulls down the file correctly into .git/lfs/objects/aa/6d/aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049 but does not reconstruct/copy it correctly to the destination folder (it results in a 1.82Gb file).
Atlassian have looked into the problem and believe that the BitBucket server is working correctly, due to the fact that the correct content is retrieved over the network into the temporary object file (the CRCs match the original file).
Note that no Git configuration has changed (smudge filters etc are the default).
If the file is removed and "git lfs pull" performed, the file is created correctly. Using "git lfs clone" also works.
First, you can confirm that BitBucket is sending the data correctly by checking the file in .git/lfs/objects
. If this checks out, the server is doing things correctly:
$ cd .git/lfs/objects/aa/6d
$ shasum -a 256 aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049
aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049 aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049
Next, the fact that it works with git lfs clone
and git lfs pull
means the Git LFS code for copying files works. This ultimately comes down to lfs.LinkOrCopy()
, which either makes a hard link (if .git/lfs/objects
is on the same partition as your working dir) or manually writes the bytes to the new location.
So, this leaves the git filters. There are two modes that could be causing problems:
process
mode is default in your reported git and lfs versions. The process
mode uses a protocol to receive smudge requests from Git via STDIN, and to deliver object contents from LFS via STDOUT.smudge
mode is basically a request from Git to get the contents for a single file. It may be worth triggering the smudge filter directly to see if it does the same thing:$ GIT_LFS_SKIP_SMUDGE=1 git clone https://git-server/your/repo
$ cd repo
$ cat path/to/file
version https://git-lfs.github.com/spec/v1
oid sha256:98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
size 3
$ cat path/to/file | git lfs smudge > smudged-file.bin
$ shasum -a 256 hi.txt
98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 hi.txt
Based on that, I think one of the following could be happening:
STDOUT
correctly. Maybe there are some special characters in your files that are stripped or something?GIT_TRACE_PACKET=1 git clone
or something, but the output would be _massive_ for your files :)Some questions:
It'd be really helpful if we could get a sample file that exhibits this behavior. I imagine that's a no-go, so we may have to come up with a special build of LFS with special tracing powers. @ttaylorr, any thoughts? Did I miss any debugging questions or trial commands to run?
@technoweenie that looks pretty comprehensive. My hunch is that it's related to one of the three issues you described as being process filter-related.
@obe1line do you have a copy of the file or repository that you could share? I think that would be the easiest way for me to debug this going forward.
Hey @obe1line, I ran this by a Git core dev, and he mentioned that Git on Windows does not support files over 4GB. Unfortunately, this is not something we can fix in LFS. The best we can do now add a warning when large objects are added.
As a workaround, I think you should disable smudging completely:
$ git lfs install --skip-smudge
$ git lfs env
... snip
git config filter.lfs.process = "git-lfs filter-process --skip"
git config filter.lfs.smudge = "git-lfs smudge --skip -- %f"
After that, you'll have to run git lfs pull
any time you change branches or fetch updates from your remote.
@ttaylorr I think we should add an early warning in the filter smudge and process code, perhaps linking to a page offering this workaround.
@technoweenie Thanks for the detailed comment.
In answer to your questions:
I used the GIT_TRACE_PACKET, GIT_TRACE and GIT_CURL_VERBOSE to produce output previously and yes, it is rather large (~12Gb from memory).
@ttaylorr It can be reproduced with any file >4Gb, nothing special about the repository. The file is pushed with the Linux client, and fetched with the Windows client - I assume if I had used Windows to push, then the file may not have transferred fully into the repository.
@technoweenie Is the workaround only applicable to fetch? i.e. would uploading a file via "git lfs push" work correctly even with the Git 4Gb limit?
Is the workaround only applicable to fetch? i.e. would uploading a file via "git lfs push" work correctly even with the Git 4Gb limit?
It probably won't work on Windows. Added files are passed through the LFS clean
filter (basically the reverse of the smudge
filter) are probably subject to the same size limitation in Git.
I'm seeing this on Ubuntu that is running on a Windows machine but is rebooted with Grub. This may have something to do with the partition rather than windows itself? See attached photo.
Hi @ndebard -- thanks for commenting here. I think what your experiencing is the correct behavior, even though the output can be a bit confusing. This message appears when a file greater than 4.0 GB is copied into your working tree (it looks like data/dump_006_ls_0.dat
is 9.0 GB). The reason this message occurs is to warn when checking out that same file on Windows, there may be issues with certain versions of Git.
Though the message isn't pertinent to you on Ubuntu (?), it will have relevancy for any colleagues of yours using Windows.
Hi @ttaylorr
I'm very new to using Git so I just want to gain further clarification on your response here. If I'm correct, what you're saying is that this "malformed smudge" output isn't an error message, but instead a warning to Windows users pertaining to their version of Git? Considering that I'm using Ubuntu, would it be safe to simply ignore the message?
Considering that I'm using Ubuntu, would it be safe to simply ignore the message?
It is safe to ignore the message for yourself -- since you're not on Windows, your repository should work as expected even though it contains large files in the working copy. The warning is to remind you that a checkout of your repository _may not work_ on a Windows machine.
Could someone please clarify what this smude
filter does, and how turning it off makes windows able to store bigger files?
Or is large file system not able to process large files over 4 GB at all? Would the common git be able to store them files?
Could someone please clarify what this
smude
filter does
Hi @luckydonald, thanks for asking! The smudge
filter is applied to transform content from your index into your working copy. In practice, this means that we take the small reference (pointer) that LFS actually stores in your Git repository, and transparently convert it into a large file on your disk, so that it appears as if the large file itself is present in the repository.
and how turning it off makes windows able to store bigger files?
I don't think that turning off the smudge filter would make Windows able to store bigger files. The issue is rather that Git has a limitation on Windows of not being able to correctly smudge files when the size of the outgoing content is larger than 4GB.
This isn't an inherent limit of the file system, rather an implementation detail of Git.
So there's no viable workaround for this? Basically, if you have 4GB files in your repo, you can clone it if you disable the smudge filter, but you can't commit or push?
Basically, if you have 4GB files in your repo, you can clone it if you disable the smudge filter, but you can't commit or push?
Not quite. The issue is with 4 GiB files of _any source_, them coming from Git LFS is only one half of the problem. If another filter puts them there, or that's how they're stored in your Git repository, then it is not guaranteed that it will be checked out correctly by Git on Windows.
With regards to the Git LFS-part of that problem, if you have a >4 GiB LFS _object_ (read: not a Git object, but an LFS one), you can avoid introducing that into your local copy by passing --exclude=path/to/file
(or lfs.fetchexclude=path/to/file
in your .gitconfig). With either (or both) of these options passed, Git LFS will not download or check out the large file into your working copy, thus side-stepping the problem.
One thing that I think is important to remember, is that this issue does _not_ cause problems on Unix, macOS, or other platforms that don't have the >4 GiB file-size limitation. So, if you have a >4 GiB file in your repository (LFS or otherwise), it should work fine on platforms other than Windows. If you're on Windows, we are stuck with this behavior, so --exclude
is the best way forward, IMHO.
For the record, I have been using Git LFS on Windows by disabling the smudge filter and the process filter. Files >4GB seem to work fine. It just means you need to manually git lfs pull
any time you pull, switch branches, or clone a repo.
First: my colleges an me encounter that problem on windows 10 with newest NTFS filesystem which is without any doublt capable of handling files > 4 GB and even files of size up to 16Exabytes (see here http://www.ntfs.com/ntfs_vs_fat.htm).
Second: the newest git (including lfs feature) is great.
Third: Thanx for all those proposals for avoiding the problem, but at the end we don't want to avoid to use/download/clone/pull files > 4 GB.
We want git (lfs) to handle them as lfs files correctly.
Is there a plan and timeframe to fix that problem on windows?
So as I understand this issue, it's due to Git on Windows not supporting files greater than 4 GB properly. The issue is that the smudge and clean filters are invoked by Git, and Git itself doesn't handle this gracefully. Git LFS does handle this gracefully, but because it's invoked by Git (unless you disable the filters), the data is corrupted before it makes it to Git LFS.
To explain the issue with Git, it's because the Git codebase uses unsigned long
for certain values. On a 64-bit Unix system (including Linux and macOS), that type is 64 bits in length, and unsigned long
is the canonical way to write a system-sized unsigned word type. However, on Windows, unsigned long
is always 32 bits. Consequently, even a 64-bit Git on Windows doesn't handle large files. There's an explanation of this issue in a thread on the Git list.
Git for Windows is already tracking this issue as git-for-windows/git#1063. The good news is that when this is fixed in Git, everything should automatically work with any version of Git LFS. In the mean time, there isn't anything we as Git LFS developers can do to fix it.
@bk2204 That sounds correct to me. I just wanted to reiterate that the workaround from @technoweenie does work:
As a workaround, I think you should disable smudging completely:
$ git lfs install --skip-smudge $ git lfs env ... snip git config filter.lfs.process = "git-lfs filter-process --skip" git config filter.lfs.smudge = "git-lfs smudge --skip -- %f"
After that, you'll have to run
git lfs pull
any time you change branches or fetch updates from your remote.
So it is possible to still work with Git LFS on Windows for large files, but you must disable smudge
or your files will get corrupted.
Git for Windows is already tracking this issue as git-for-windows/git#1063. The good news is that when this is fixed in Git, everything should automatically work with any version of Git LFS. In the mean time, there isn't anything we as Git LFS developers can do to fix it.
Thank you very much to clarify the situation. I think it was necessary to state quite clear what we "Big File Users" are waiting for :-)
@ttaylorr @technoweenie Unless I'm mistaken, the workaround of disabling smudge only sortof works. If you have a nice big new file upstream, and you git pull/git lfs pull, everything is great. But then if you go git checkout HEAD~1
followed by git checkout master
, the checkout of master will fail as it will try to smudge the big file because it is already present in .git/lfs/objects.
To recover from this, you basically need to nuke your .git/lfs directory, as the presence of the file in the local lfs objects folder means smudge runs for it even though the skip options are set.
One important caveat: I'm on an older git/git-lfs version. I'll upgrade tomorrow, but just based on comparing the old and current source of git-lfs, I don't expect different behavior. If what I'm talking about sounds wacky/not the behavior you currently expect, perhaps it has already been fixed.
Hey. This issue should not be closed as the windows client still corrupts files > 4GB.
This is just depending on another bug in git.
This issue should not be closed as the windows client still corrupts files > 4GB.
If I recall correctly, this is not related to a bug in Git, but rather is an inherent limitation of the Windows filesystem.
The issue as described in the Git for Windows bug mentioned earlier in this thread (git-for-windows/git#1063) points to a problem with incorrect datatypes in the Git code, not with Windows filesystem limitations:
The problem is the Git source code, which uses
unsigned long
in places wheresize_t
would be correct.
If I recall correctly, this is not related to a bug in Git, but rather is an inherent limitation of the Windows filesystem.
The limitation on files larger than 4 GiB is for FAT, but not NTFS. NTFS is capable of large files, but you're correct that if you're using a flash drive for your LFS-using Git repository, then you probably have a file system limitation. I think that most Windows users are using NTFS for their systems, though.
I believe in this case the issue is as @shabbyrobe quoted: we use unsigned long
, which is 32 bit on 64-bit Windows, instead of size_t
, which is 64 bit. While Unix systems have unsigned long
equal to the pointer size (they are LP64 systems), Windows always has a 32-bit long
(the LLP64 model).
There are a handful of patches going into Git 2.21 to address some of these issues, although they may not be complete, so it may be useful to keep an eye out for improvements in that regard.
I wanted to add that my previous issue with the workaround is no longer present in the latest version of git/git-lfs. So the workaround @technoweenie described has been working great for me. We run a little lfs server (using the lfs-test-server, which is officially non-production, but pretty simple) and have several >4GB files in our repo.
FYI, the core issue in git-for-windows (https://github.com/git-for-windows/git/issues/1063) has been closed due to inactivity, but is not fixed. I wanted to try to help others watching this issue and the git-for-windows issue avoid being confused by the closed state change. Utilizing the workaround is probably the best way to deal with this issue for the foreseeable future.
Hello all,
would putting the workaround git lfs pull
suggested by @technoweenie in a post-checkout hook work correctly or would that skip some cases?
I'm inclined to say adding that to the post-checkout
hook will likely work, but you may also need to add it to the post-commit
and post-merge
hooks to catch all cases (e.g. a merge brings in a change from another branch). Recent versions of Git LFS should honor any changes you make to those hooks unless you run git lfs update --force
.
One 'hole' in the workaround/workflow of configuring filter.lfs.process
and filter.lfs.smudge
to be skipped and doing git lfs pull
remains. As @technoweenie noted, the same limitation does apply to the clean filter, so if you commit and then push a change involving a >4GB file, that file will be malformed after the push. The push will even warn you and direct you to look at smudge's help.
Git will notice that the file is different than it should be if you do a git status
. A checkout
and subsequent git-lfs pull
will fix it. But it would be nice if we could configure filter.lfs.clean
to be skipped as well; unfortunately it does not have a --skip
parameter. Would such a a parameter make sense, or are there cases that fall apart? If it did make sense, having it would make for a more robust workflow.
I'm not sure it makes sense, since there's otherwise no way to add a file. git lfs checkout
provides a way to check out a file without going through Git, but there's no way to add the LFS contents without doing a git add
.
Ah ok, sorry - I misunderstood the process... I realize skipping clean
doesn't make sense now. So when I edit an existing >4GB LFS file, commit, and then push, the end result is a corrupt file. Can you help me understand what low-level git-lfs commands are running during that push? clean
checks file size which seems weird to me since it is turning a file into a pointer, but maybe I'm missing something. I'm trying to understand what command is resulting in the corrupt file so that I can try to avoid that command running, if that makes sense.
There are two possible processes: the clean
filter (in the config as filter.lfs.clean
) and the filter process (filter.lfs.process
). Those correspond to git lfs clean
and git lfs filter-process
. The latter is used for efficient smudging as well, and is preferred over the actual smudge and clean filters if available.