Git-lfs: git archive doesn't include LFS files

Created on 23 Jun 2016  路  33Comments  路  Source: git-lfs/git-lfs

  1. git archive any commit with files tracked by LFS

The archive contains the LFS pointer files instead of the files themselves.

enhancement

Most helpful comment

Just to clarify, without a working copy, git archive doesn't include LFS files or any other filtered files. This is a limitation in upstream Git, and it translates over to the GitHub releases, since they currently use that same mechanism. This also affects things like working tree encodings, which are common on Windows; CRLF handling; and anything else using this functionality.

I understand this is frustrating and surprising for a lot of people. It _is_ something we want to fix, but we haven't gotten to it yet.

Since the GitHub-provided archives aren't guaranteed to be stable (that is, they will contain the same contents but need not be bit-for-bit identical), many developers choose to produce their own archives so they can digitally sign them or provide file hashes, which is also a suitable workaround for this issue.

All 33 comments

I hoped that git archive would use the configured smudge filter, but apparently not. We'd have to modify git archive to support LFS, which probably won't happen since it's a core command, and LFS is an uncommon extension. I think implementing a git lfs archive command would be better.

Could you please clarify - is this issue closed as "Will not fix", or there is a solution already (or separate ticket opened to track it)?

Are there any recommended work arounds? At this point in time, all I can think of is to manually copy LFS-enabled files from my working copy. That can't be the best way, can it?

+1 to this feature. Badly needed.

Any suggestions for a workaround? I depend on git-archive for an internal workflow. We now track multiple files in LFS and manual copying no longer works.

FWIF, answering my own question. I'm not terribly happy about this but it works.

      manual_files_to_copy = `git lfs ls-files| cut -d ' ' -f 3-`.split("\n")
      manual_files_to_copy.each { |f|
        full_path_to_file = "#{source_dir}/#{f}"
        puts "copying #{full_path_to_file} to #{f}".red
        sh "cp -f \"#{full_path_to_file}\" \"#{f}\"" || abort
      }

We (in the Docker official images over in @docker-library) use git archive heavily to avoid even having a local checkout of the things we're generating a tarball for (we invoke git archive from directly inside a bare repository with no working copy), so Git LFS not being able to support that is a major downer. :disappointed:

+1, I store pre-processed machine learning data sets using git lfs and when I need to train the model on a cloud machine, it'd be way faster for me to git lfs archive vs git lfs clone

Here's a script that gets the job done. Not the cleanest thing (it creates a repo to pull down the file), but it works.

#!/bin/bash

# Do a git-archive and put files in CWD
# If LFS pointer file, download actual file

#set -x

if [ $# -lt 2 ]; then
    echo "Usage: $0 <remote> <tree-ish> [<path>...]"
    exit 1
fi

REMOTE=$1
TREEISH=$2
shift 2

# Create local repo in tmp directory to download LFS files

DIR=.tmp$$
mkdir $DIR
cd $DIR

git init                      > /dev/null
git remote add origin $REMOTE > /dev/null

git archive --remote=$REMOTE $TREEISH $* | tar xvf -
if [ $? -ne 0 ]; then
    echo "git-archive|tar failed"
    cd ..
    rm -rf $DIR
    exit 1
fi

MATCHES=$(grep --files-with-matches --recursive "oid sha256" .)
for FILE in $MATCHES; do
    # Echo SHA into smudge to download file
    # Permissions perserved by redirecting into orignal file
    SHA=$(cat $FILE)
    echo "$SHA" | git lfs smudge $FILE > $FILE
done

rm -rf .git

rsync -a * .* ..

cd ..
rm -rf $DIR

exit 0

+1 from my side.

Webview download zips as well as the zips that are attached at releases are a form of deployment for non-git-users. Any workaround, that needs a git client on the users side, is therefore effectively useless.

In any case: zip-files of the full repo, that do not contain the full repo are (in most cases) useless. Especially, as the user is not aware of, what's missing. Its an intransparent unexpected behaviour to the user.

So from my point of view, the default behaviour must be, including all files, no matter where they come from.

Can we re-open this?

Can we re-open this?

@jjgod I think that I would prefer to leave this issue closed. (@larsxschneider, please correct me if I am wrong) but I believe changing the behavior of git archive would require an upstream change in Git, in which case I think that the best place to have this discussion would be on [email protected].

If the issue is referring to the "download ZIP" functionality of GitHub.com, that request should be sent to [email protected], since this repository does not pertain to GitHub.com, only Git LFS.

No, I鈥檓 referring to having a git lfs archive command, does it exist now?

No, I鈥檓 referring to having a git lfs archive command, does it exist now?

My misunderstanding! Thank you for correcting me. This is not currently supported by Git LFS, but I think that it is a great idea to add. I think that the release for 2.5.0 is pretty full already, so I am not immediately sure the first release that will include this functionality, but I will re-open this accordingly, nonetheless.

I would like to behave the "zip"-Button in Github like that, but, that should be a functionality, that is available in git directly to be consistent.

I would prefer, that 'git archive' would do that automatically, because of transparency to the user.
In large multi-team projects, users do not necessarily know, that LFS is used for some binaries, that are located somewhere.

I understand, that for backwards compatibiliy and architectural issues you might not want it.
But at least, I would expect git archive to detect an LFS repo and issue a warning about missing binaries and recommending usage of git lfs archive.

Github could then simply use git lfs archive if an LFS archive is detected.

Hi, @Wildeast -- thanks for the feedback.

I would like to behave the "zip"-Button in Github like that, but, that should be a functionality, that is available in git directly to be consistent.

Since this is functionality pertaining to GitHub (and this repository pertains only to Git LFS), I think that this request could be sent to [email protected] instead.

I would prefer, that 'git archive' would do that automatically, because of transparency to the user.
In large multi-team projects, users do not necessarily know, that LFS is used for some binaries, that are located somewhere.

Agreed, though I don't think that this is possible in Git, as it does not have explicit knowledge of Git LFS, nor does it provide hooks to modify the contents of a repository.

#

(cc @larsxschneider for the following)

As a short-term solution, I would like to teach Git LFS a new command, git-lfs-archive(1) and implement according to the discussion above.

In the long-term, I wonder about the feasibility of teaching git-archive(1) how to either (a) respect index/worktree filters, or (b) add additional hooks to git-archive(1) in order to modify the contents of an archive before committing it.

@Wildeast @jjgod Can you elaborate a bit more what is not working for you or provide a test case? git archive generates an archive with LFS files on my machine with Git 2.16.2:

# create a Git LFS repo
$ git init .
Initialized empty Git repository in /tmp/demo/.git/

$ git lfs track '*.bin'
Tracking "*.bin"

$ echo "data" > foo.bin

$ git add .

$ git commit -m "bar"
[master (root-commit) f29d9fe] bar
 2 files changed, 4 insertions(+)
 create mode 100644 .gitattributes
 create mode 100644 foo.bin

$ git cat-file -p HEAD:foo.bin
version https://git-lfs.github.com/spec/v1
oid sha256:6667b2d1aab6a00caa5aee5af8ad9f1465e567abf1c209d15727d57b3e8f6e5f
size 5

$ cat foo.bin
data

# archive the repo
$ git archive -o latest.zip HEAD

# inspect the archived repo
$ unzip latest.zip -d archive
Archive:  latest.zip
f29d9fe8fcbf7beade9a0c3602fd3890eb709dc5
  inflating: archive/.gitattributes
 extracting: archive/foo.bin

# it works! :-)
$ cat archive/foo.bin
data

git archive generates an archive with LFS files on my machine with Git 2.16.2:

@larsxschneider 馃槺 -- does git-archive(1) take into account the working copy? If so, we could implement a wrapper command in LFS via the following:

  1. Check out LFS files in the entire repository (respecting --include, --exclude).
  2. Run git-archive(1), passing down any options that were given to the proposed git-lfs-archive(1).

Yeah, I believe it does now, but that doesn't work for git archive from a --bare clone. :disappointed: (So having an explicit git lfs archive subcommand to be able to take a tarball from a bare git archive and "upgrade" it to include LFS bits -- or at least a way to fetch the LFS bits would be awesome. I've got additional details from the last time I tested this in https://github.com/docker-library/official-images/issues/1095#issuecomment-256481338.)

The command I have been testing is:

git archive --format=tar --remote=<git url> <tag> so it doesn't have a working copy.

The command I have been testing is:

git archive --format=tar --remote=<git url> <tag> so it doesn't have a working copy.

馃憤 1 鉂わ笍 1

I see, so the <tag> argument causes git-archive(1) to produce a tarball that _doesn't_ contain smudged LFS contents?

Yes, exactly.

This bug likely affects the "Download ZIP" button on GitHub. That button doesn't include LFS files in the download.

IMHO, that tarballs downloaded from github do not include LFS files violates the "principle of least surprise"

As a downstream maintainer of software in Linux distros, I have no idea that upstream project uses LFS until I download the new source tarball and it fails to build due to "x is not a valid y file". It's not immediately obvious why builds fail until I start inspecting the contents of the file. It's not at all obvious that a repo contains LFS-stored files when I am downloading the tarball because I am using wget without looking at the github site directly.

Why wouldn't you want source tarballs downloaded from a project's tagged release to include all the source necessary to rebuild the project?

Just to clarify, without a working copy, git archive doesn't include LFS files or any other filtered files. This is a limitation in upstream Git, and it translates over to the GitHub releases, since they currently use that same mechanism. This also affects things like working tree encodings, which are common on Windows; CRLF handling; and anything else using this functionality.

I understand this is frustrating and surprising for a lot of people. It _is_ something we want to fix, but we haven't gotten to it yet.

Since the GitHub-provided archives aren't guaranteed to be stable (that is, they will contain the same contents but need not be bit-for-bit identical), many developers choose to produce their own archives so they can digitally sign them or provide file hashes, which is also a suitable workaround for this issue.

@bk2204 a simple fix would be a setting to remove the GitHub-provided archives on a per-repo basis. I would turn that on for ICU and then @mbooth101 would not have the unpleasant surprise (sorry!)

Thanks for the reply @bk2204!

Do you know if there anything externally tracking the feature request to remove the automatic "Source code" download links that GitHub adds to a Release or Tag?

Do you know if there anything externally tracking the feature request to remove the automatic "Source code" download links that GitHub adds to a Release or Tag?

No, but this wouldn't be the place that such a request would live. This repository offers support only for the Git LFS client, so any GitHub-specific feature requests should be sent to [email protected], instead.

This is very confusing, this bug is three years old and was closed and was opened again, it is the top hit on Google for "git lfs archive", I cannot tell if there is a fix for using "git archive" (this is a very basic git command, my build scripts depend on it!) when lfs is in use or not

If you have Git LFS enabled (i.e., the filter rules are properly set up via git lfs install), a recent version of git archive will include the LFS files in it, even in a bare repository. It can be slow, because Git uses the smudge filter instead of the filter process, but that would be a Git issue to be addressed with upstream.

Currently, the archives on github.com do not, which is where the confusion lies. That is separate and not related to this project; issues with that should be addressed to [email protected].

Since this is actually working with git archive, I'm going to close this issue to avoid additional confusion.

Okay, I see. What about the requests for a "git lfs archive" which do not have these limitations though?

As a philosophical matter: I don't think I actually recognize these distinctions you are making between "git", "git-lfs", and "github" as if these were three different entities. The official git-lfs web page specifically identifies git-lfs as a Github project (I don't know if that's true, but it's what your webpage says). Git is not a Github project, but many of us are using git because Github directed us to in order to upload files to Github. From my perspective Github is offering us a feature (large file hosting) which is seriously broken around archival, and the explanation for why parts of it are broken is to point to some arcana about the difference between a "filter" and a "smudge" within the git internals, which is not something that matters to us as end users. From my perspective it doesn't matter how Github's chosen command line tool is internally architected, what matters to me is whether Github's feature overall works.

I don't think we're going to reimplement the functionality in core Git. The performance will be fine unless you're using a bare repository or archiving a revision which you haven't ever checked out, since you'll already have the LFS objects on your local system. The slowness only comes when downloading objects from the remote. We try very hard not to reimplement core Git's functionality, because it has to be maintained long term and it tends to diverge from Git and lead to a worse experience.

I appreciate that folks don't want to draw a distinction between Git, Git LFS, and GitHub, but we try to direct people to the right place because different folks maintain each. If you tell this project about problems in Git, since there are three core maintainers for this project and none of us work on it full time, it will take us a long time to work with upstream about your problem. If you talk to upstream directly, you stand a better chance of someone addressing your problem.

Similarly, GitHub has different folks maintaining the archive support from those working on this project. We can't just magically fix the issue, even though we'd really like to. Telling us about the problem doesn't help it get solved; the folks who can fix the problem are already aware of it and know it's a problem, and haven't solved it because it's nontrivial.

We generally direct issues related to hosting to hosting providers because there are a lot of people using lots of different hosts (GitHub, GitLab, Bitbucket, etc.) and it's not always obvious what the solution is to a problem and the support channels for those providers often have a much better understanding of potential solutions and tracking long-term work. For example, someone the other day had an authentication issue that I didn't know how to solve, but I was sure that provider's support folks would know how to solve it.

Finally, even though this is a GitHub-sponsored open source project, it's still an open source project, and helping people with problems is a significant portion of our time spent. Time we spend dealing with issues that are not within the scope of the Git LFS client software means less time we spend on fixing and improving the Git LFS client itself. I know it's frustrating when we say that, but it really is true.

Was this page helpful?
0 / 5 - 0 ratings