Git-lfs: Git on Windows client corrupts files > 4Gb

Created on 24 Jul 2017 · 31Comments · Source: git-lfs/git-lfs

When cloning a file larger than 4Gbyte from a BitBucket server repository (LFS enabled), the file is not reconstructed correctly E.g. a 6Gb file results in a 700Mb file. The lfs/objects folder contains the correct file however.

Steps to reproduce:

Server: Basic install of Debian with Bitbucket Server 4.6, Git 2.13 (64-bit)
Client 1: Ubuntu 16.02 64 bit, Git 2.13 and Git-LFS 2.2.1 (both 64-bit)
Client 2: Windows 64 bit (2012), Git 2.13 and Git-LFS 2.2.1 (both 64-bit)

Create a repository on the server with LFS support enabled
Clone the repository (git clone ssh://git@server:7999/tst/test-git.git)
git track '*.iso' , commit and push to the remote
add an iso file larger than 4Gb (I used Visual Studio 2013 Update 4 which is 5.82Gb), commit and push to the remote
Clone the repository into a clean folder

Client 1 works correctly
Client 2 pulls down the file correctly into .git/lfs/objects/aa/6d/aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049 but does not reconstruct/copy it correctly to the destination folder (it results in a 1.82Gb file).

Atlassian have looked into the problem and believe that the BitBucket server is working correctly, due to the fact that the correct content is retrieved over the network into the temporary object file (the CRCs match the original file).

Note that no Git configuration has changed (smudge filters etc are the default).

If the file is removed and "git lfs pull" performed, the file is created correctly. Using "git lfs clone" also works.

git-core windows

Source

obe1line

All 31 comments

First, you can confirm that BitBucket is sending the data correctly by checking the file in .git/lfs/objects. If this checks out, the server is doing things correctly:

$ cd .git/lfs/objects/aa/6d
$ shasum -a 256 aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049
aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049  aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049

Next, the fact that it works with git lfs clone and git lfs pull means the Git LFS code for copying files works. This ultimately comes down to lfs.LinkOrCopy(), which either makes a hard link (if .git/lfs/objects is on the same partition as your working dir) or manually writes the bytes to the new location.

So, this leaves the git filters. There are two modes that could be causing problems:

The process mode is default in your reported git and lfs versions. The process mode uses a protocol to receive smudge requests from Git via STDIN, and to deliver object contents from LFS via STDOUT.
The older smudge mode is basically a request from Git to get the contents for a single file. It may be worth triggering the smudge filter directly to see if it does the same thing:

$ GIT_LFS_SKIP_SMUDGE=1 git clone https://git-server/your/repo
$ cd repo
$ cat path/to/file
version https://git-lfs.github.com/spec/v1
oid sha256:98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
size 3

$ cat path/to/file | git lfs smudge > smudged-file.bin
$ shasum -a 256 hi.txt
98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4  hi.txt

Based on that, I think one of the following could be happening:

Git LFS is not sending the correct contents through the filter process. I imagine this would've caught by now, as it would affect every object.
The file contents aren't being piped through STDOUT correctly. Maybe there are some special characters in your files that are stripped or something?
Git is not receiving the file connects correctly. You can confirm this by running GIT_TRACE_PACKET=1 git clone or something, but the output would be _massive_ for your files :)

Some questions:

Are you experiencing this with a specific type of file, or just files over a certain size?
Are the corrupt files being truncated? If a 6GB file outputs a 700MB file to your working directory, does it match the first 700MB of the 6GB file?

It'd be really helpful if we could get a sample file that exhibits this behavior. I imagine that's a no-go, so we may have to come up with a special build of LFS with special tracing powers. @ttaylorr, any thoughts? Did I miss any debugging questions or trial commands to run?

technoweenie on 24 Jul 2017

@technoweenie that looks pretty comprehensive. My hunch is that it's related to one of the three issues you described as being process filter-related.

@obe1line do you have a copy of the file or repository that you could share? I think that would be the easiest way for me to debug this going forward.

ttaylorr on 24 Jul 2017

Hey @obe1line, I ran this by a Git core dev, and he mentioned that Git on Windows does not support files over 4GB. Unfortunately, this is not something we can fix in LFS. The best we can do now add a warning when large objects are added.

As a workaround, I think you should disable smudging completely:

$ git lfs install --skip-smudge
$ git lfs env
... snip
git config filter.lfs.process = "git-lfs filter-process --skip"
git config filter.lfs.smudge = "git-lfs smudge --skip -- %f"

After that, you'll have to run git lfs pull any time you change branches or fetch updates from your remote.

@ttaylorr I think we should add an early warning in the filter smudge and process code, perhaps linking to a page offering this workaround.

technoweenie on 25 Jul 2017

@technoweenie Thanks for the detailed comment.
In answer to your questions:

All files over 4Gb are affected, not just specific types
Yes, files are truncated with the 700Mb matching the original file
Copying the file from .git/lfs/objects/ and renaming matches the original file

I used the GIT_TRACE_PACKET, GIT_TRACE and GIT_CURL_VERBOSE to produce output previously and yes, it is rather large (~12Gb from memory).

@ttaylorr It can be reproduced with any file >4Gb, nothing special about the repository. The file is pushed with the Linux client, and fetched with the Windows client - I assume if I had used Windows to push, then the file may not have transferred fully into the repository.

@technoweenie Is the workaround only applicable to fetch? i.e. would uploading a file via "git lfs push" work correctly even with the Git 4Gb limit?

obe1line on 25 Jul 2017

Is the workaround only applicable to fetch? i.e. would uploading a file via "git lfs push" work correctly even with the Git 4Gb limit?

It probably won't work on Windows. Added files are passed through the LFS clean filter (basically the reverse of the smudge filter) are probably subject to the same size limitation in Git.

technoweenie on 25 Jul 2017

I'm seeing this on Ubuntu that is running on a Windows machine but is rebooted with Grub. This may have something to do with the partition rather than windows itself? See attached photo.
20180301_021229

ndebard on 1 Mar 2018

Hi @ndebard -- thanks for commenting here. I think what your experiencing is the correct behavior, even though the output can be a bit confusing. This message appears when a file greater than 4.0 GB is copied into your working tree (it looks like data/dump_006_ls_0.dat is 9.0 GB). The reason this message occurs is to warn when checking out that same file on Windows, there may be issues with certain versions of Git.

Though the message isn't pertinent to you on Ubuntu (?), it will have relevancy for any colleagues of yours using Windows.

ttaylorr on 3 Mar 2018

Hi @ttaylorr
I'm very new to using Git so I just want to gain further clarification on your response here. If I'm correct, what you're saying is that this "malformed smudge" output isn't an error message, but instead a warning to Windows users pertaining to their version of Git? Considering that I'm using Ubuntu, would it be safe to simply ignore the message?

dwall17 on 7 Mar 2018

Considering that I'm using Ubuntu, would it be safe to simply ignore the message?

It is safe to ignore the message for yourself -- since you're not on Windows, your repository should work as expected even though it contains large files in the working copy. The warning is to remind you that a checkout of your repository _may not work_ on a Windows machine.

ttaylorr on 21 Mar 2018

Could someone please clarify what this smude filter does, and how turning it off makes windows able to store bigger files?
Or is large file system not able to process large files over 4 GB at all? Would the common git be able to store them files?

luckydonald on 4 Apr 2018

Could someone please clarify what this smude filter does

Hi @luckydonald, thanks for asking! The smudge filter is applied to transform content from your index into your working copy. In practice, this means that we take the small reference (pointer) that LFS actually stores in your Git repository, and transparently convert it into a large file on your disk, so that it appears as if the large file itself is present in the repository.

and how turning it off makes windows able to store bigger files?

I don't think that turning off the smudge filter would make Windows able to store bigger files. The issue is rather that Git has a limitation on Windows of not being able to correctly smudge files when the size of the outgoing content is larger than 4GB.

This isn't an inherent limit of the file system, rather an implementation detail of Git.

ttaylorr on 5 Apr 2018

So there's no viable workaround for this? Basically, if you have 4GB files in your repo, you can clone it if you disable the smudge filter, but you can't commit or push?

shabbyrobe on 22 Apr 2018

Basically, if you have 4GB files in your repo, you can clone it if you disable the smudge filter, but you can't commit or push?

Not quite. The issue is with 4 GiB files of _any source_, them coming from Git LFS is only one half of the problem. If another filter puts them there, or that's how they're stored in your Git repository, then it is not guaranteed that it will be checked out correctly by Git on Windows.

With regards to the Git LFS-part of that problem, if you have a >4 GiB LFS _object_ (read: not a Git object, but an LFS one), you can avoid introducing that into your local copy by passing --exclude=path/to/file (or lfs.fetchexclude=path/to/file in your .gitconfig). With either (or both) of these options passed, Git LFS will not download or check out the large file into your working copy, thus side-stepping the problem.

One thing that I think is important to remember, is that this issue does _not_ cause problems on Unix, macOS, or other platforms that don't have the >4 GiB file-size limitation. So, if you have a >4 GiB file in your repository (LFS or otherwise), it should work fine on platforms other than Windows. If you're on Windows, we are stuck with this behavior, so --exclude is the best way forward, IMHO.

ttaylorr on 24 Apr 2018

For the record, I have been using Git LFS on Windows by disabling the smudge filter and the process filter. Files >4GB seem to work fine. It just means you need to manually git lfs pull any time you pull, switch branches, or clone a repo.

chowey on 3 Nov 2018

First: my colleges an me encounter that problem on windows 10 with newest NTFS filesystem which is without any doublt capable of handling files > 4 GB and even files of size up to 16Exabytes (see here http://www.ntfs.com/ntfs_vs_fat.htm).

Second: the newest git (including lfs feature) is great.

Third: Thanx for all those proposals for avoiding the problem, but at the end we don't want to avoid to use/download/clone/pull files > 4 GB.
We want git (lfs) to handle them as lfs files correctly.

Is there a plan and timeframe to fix that problem on windows?

JohnFrampton on 6 Nov 2018

So as I understand this issue, it's due to Git on Windows not supporting files greater than 4 GB properly. The issue is that the smudge and clean filters are invoked by Git, and Git itself doesn't handle this gracefully. Git LFS does handle this gracefully, but because it's invoked by Git (unless you disable the filters), the data is corrupted before it makes it to Git LFS.

To explain the issue with Git, it's because the Git codebase uses unsigned long for certain values. On a 64-bit Unix system (including Linux and macOS), that type is 64 bits in length, and unsigned long is the canonical way to write a system-sized unsigned word type. However, on Windows, unsigned long is always 32 bits. Consequently, even a 64-bit Git on Windows doesn't handle large files. There's an explanation of this issue in a thread on the Git list.

Git for Windows is already tracking this issue as git-for-windows/git#1063. The good news is that when this is fixed in Git, everything should automatically work with any version of Git LFS. In the mean time, there isn't anything we as Git LFS developers can do to fix it.

bk2204 on 6 Nov 2018

@bk2204 That sounds correct to me. I just wanted to reiterate that the workaround from @technoweenie does work:

As a workaround, I think you should disable smudging completely:
$ git lfs install --skip-smudge
$ git lfs env
... snip
git config filter.lfs.process = "git-lfs filter-process --skip"
git config filter.lfs.smudge = "git-lfs smudge --skip -- %f"
After that, you'll have to run git lfs pull any time you change branches or fetch updates from your remote.

So it is possible to still work with Git LFS on Windows for large files, but you must disable smudge or your files will get corrupted.

chowey on 6 Nov 2018

Git for Windows is already tracking this issue as git-for-windows/git#1063. The good news is that when this is fixed in Git, everything should automatically work with any version of Git LFS. In the mean time, there isn't anything we as Git LFS developers can do to fix it.

Thank you very much to clarify the situation. I think it was necessary to state quite clear what we "Big File Users" are waiting for :-)

JohnFrampton on 7 Nov 2018

@ttaylorr @technoweenie Unless I'm mistaken, the workaround of disabling smudge only sortof works. If you have a nice big new file upstream, and you git pull/git lfs pull, everything is great. But then if you go git checkout HEAD~1 followed by git checkout master, the checkout of master will fail as it will try to smudge the big file because it is already present in .git/lfs/objects.

To recover from this, you basically need to nuke your .git/lfs directory, as the presence of the file in the local lfs objects folder means smudge runs for it even though the skip options are set.

One important caveat: I'm on an older git/git-lfs version. I'll upgrade tomorrow, but just based on comparing the old and current source of git-lfs, I don't expect different behavior. If what I'm talking about sounds wacky/not the behavior you currently expect, perhaps it has already been fixed.

aggieNick02 on 6 Dec 2018

Hey. This issue should not be closed as the windows client still corrupts files > 4GB.
This is just depending on another bug in git.

tardyp on 1 Feb 2019

👍1

This issue should not be closed as the windows client still corrupts files > 4GB.

If I recall correctly, this is not related to a bug in Git, but rather is an inherent limitation of the Windows filesystem.

ttaylorr on 5 Feb 2019

👎2

The issue as described in the Git for Windows bug mentioned earlier in this thread (git-for-windows/git#1063) points to a problem with incorrect datatypes in the Git code, not with Windows filesystem limitations:

The problem is the Git source code, which uses unsigned long in places where size_t would be correct.

shabbyrobe on 5 Feb 2019

👍1

If I recall correctly, this is not related to a bug in Git, but rather is an inherent limitation of the Windows filesystem.

The limitation on files larger than 4 GiB is for FAT, but not NTFS. NTFS is capable of large files, but you're correct that if you're using a flash drive for your LFS-using Git repository, then you probably have a file system limitation. I think that most Windows users are using NTFS for their systems, though.

I believe in this case the issue is as @shabbyrobe quoted: we use unsigned long, which is 32 bit on 64-bit Windows, instead of size_t, which is 64 bit. While Unix systems have unsigned long equal to the pointer size (they are LP64 systems), Windows always has a 32-bit long (the LLP64 model).

There are a handful of patches going into Git 2.21 to address some of these issues, although they may not be complete, so it may be useful to keep an eye out for improvements in that regard.

bk2204 on 5 Feb 2019

👍1

I wanted to add that my previous issue with the workaround is no longer present in the latest version of git/git-lfs. So the workaround @technoweenie described has been working great for me. We run a little lfs server (using the lfs-test-server, which is officially non-production, but pretty simple) and have several >4GB files in our repo.

aggieNick02 on 5 Feb 2019

FYI, the core issue in git-for-windows (https://github.com/git-for-windows/git/issues/1063) has been closed due to inactivity, but is not fixed. I wanted to try to help others watching this issue and the git-for-windows issue avoid being confused by the closed state change. Utilizing the workaround is probably the best way to deal with this issue for the foreseeable future.

aggieNick02 on 1 Mar 2019

👍1

Hello all,

would putting the workaround git lfs pull suggested by @technoweenie in a post-checkout hook work correctly or would that skip some cases?

velxundussa on 9 May 2019

I'm inclined to say adding that to the post-checkout hook will likely work, but you may also need to add it to the post-commit and post-merge hooks to catch all cases (e.g. a merge brings in a change from another branch). Recent versions of Git LFS should honor any changes you make to those hooks unless you run git lfs update --force.

bk2204 on 9 May 2019

One 'hole' in the workaround/workflow of configuring filter.lfs.processand filter.lfs.smudge to be skipped and doing git lfs pull remains. As @technoweenie noted, the same limitation does apply to the clean filter, so if you commit and then push a change involving a >4GB file, that file will be malformed after the push. The push will even warn you and direct you to look at smudge's help.

Git will notice that the file is different than it should be if you do a git status. A checkout and subsequent git-lfs pull will fix it. But it would be nice if we could configure filter.lfs.clean to be skipped as well; unfortunately it does not have a --skip parameter. Would such a a parameter make sense, or are there cases that fall apart? If it did make sense, having it would make for a more robust workflow.

aggieNick02 on 7 Feb 2020

I'm not sure it makes sense, since there's otherwise no way to add a file. git lfs checkout provides a way to check out a file without going through Git, but there's no way to add the LFS contents without doing a git add.

bk2204 on 7 Feb 2020

Ah ok, sorry - I misunderstood the process... I realize skipping clean doesn't make sense now. So when I edit an existing >4GB LFS file, commit, and then push, the end result is a corrupt file. Can you help me understand what low-level git-lfs commands are running during that push? clean checks file size which seems weird to me since it is turning a file into a pointer, but maybe I'm missing something. I'm trying to understand what command is resulting in the corrupt file so that I can try to avoid that command running, if that makes sense.

aggieNick02 on 11 Feb 2020

There are two possible processes: the clean filter (in the config as filter.lfs.clean) and the filter process (filter.lfs.process). Those correspond to git lfs clean and git lfs filter-process. The latter is used for efficient smudging as well, and is preferred over the actual smudge and clean filters if available.

bk2204 on 11 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings