Git-lfs: Best practice for migrating an existing Git repo to support LFS?

Created on 22 May 2015 · 33Comments · Source: git-lfs/git-lfs

I'd like to give Git LFS a good crack with our existing ~25GB Git repo but I'm at a bit of a loss as to how we should go about migrating it to support LFS.
I'm happy to loose all history of the existing large files I'd like to track with LFS.

Do I need to write a script to find and purge all the filetypes I want to track with LFS from the existing Git history, untrack them, then retrack them with LFS?

Source

strich

Most helpful comment

I'm HOPING there is a better way to do this... BUT after ~~redacted~~ units of time, I found a way where you will NOT lose any history of existing large files. It is probably SLOW but it works.

DISCLAIMER I'm just a stubborn user, can't promise this will work out for you, BACKUP!

I have run into a similar situation myself, but in my case, all my large files were sitting in submodules. So I was able to remove the submodules, track files, and add them back (without the submodules, HORRAY!) This way I can say I did maintain all my history, only some of the history is in the submodules. I'm guessing this is not similar to your situation, and that everything is in one big repo.

1) Filter-branch to convert all tracking over to lfs

git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git lfs track "*.npy"
git add .gitattributes .gitconfig

for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/"); do
  echo "Processing ${file}"

  git rm -f --cached ${file}
  echo "Adding $file lfs style"
  git add ${file}
done' --tag-name-filter cat -- --all

You have to use tree-filter here, and not index-filter because we are adding objects back.
I just added track *.npy as an example, but add all of your track commands in there. Hopefully you can use the SAME set of rules for EVERY commit
The .gitconfig files and git config lines are so that if you have an lfs server in a separate location than the git repo, everything works. IF you have them at the same url, you can skip that step

2) Push changes to whatever remote or remotes you have, of course you have to use the -f option, which can (and WILL!) have many implications to all the other users using the repo. Make sure no one else pushes or references the commits you just rebased, or else you will have a mess. This is when all the large files are sent to the lfs server.

git push -f origin master

(Optional) Collect garbage to shrink currently checked out repo

rm .git/refs/original -rf
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc

The rm command remove a "backup copy" of the the original refs before the filter-branch command, just to CYA. but when you are done with them, WIPE EM!
If this doesn't do enough gc, check out http://stackoverflow.com/a/14728706

(Optional) Collect garbage on a bare repo (on the remotes)

git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc

I hope this helps!

Untested ideas
Instead Step 1, just add

git lfs track "*.foo"
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git add .gitattributes .gitconfig

inside the for loop on step two. It should save that wasted initial rebase, and prevent the need for the git add .

Tested using git-lfs 0.5.1 and git 1.9.4.msysgit.1 (Yes, on windows 64 bit)

andyneff on 23 May 2015

👍29 ❤3 🎉1

All 33 comments

I'm HOPING there is a better way to do this... BUT after ~~redacted~~ units of time, I found a way where you will NOT lose any history of existing large files. It is probably SLOW but it works.

DISCLAIMER I'm just a stubborn user, can't promise this will work out for you, BACKUP!

1) Filter-branch to convert all tracking over to lfs

git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git lfs track "*.npy"
git add .gitattributes .gitconfig

for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/"); do
  echo "Processing ${file}"

  git rm -f --cached ${file}
  echo "Adding $file lfs style"
  git add ${file}
done' --tag-name-filter cat -- --all

You have to use tree-filter here, and not index-filter because we are adding objects back.
I just added track *.npy as an example, but add all of your track commands in there. Hopefully you can use the SAME set of rules for EVERY commit
The .gitconfig files and git config lines are so that if you have an lfs server in a separate location than the git repo, everything works. IF you have them at the same url, you can skip that step

git push -f origin master

(Optional) Collect garbage to shrink currently checked out repo

rm .git/refs/original -rf
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc

The rm command remove a "backup copy" of the the original refs before the filter-branch command, just to CYA. but when you are done with them, WIPE EM!
If this doesn't do enough gc, check out http://stackoverflow.com/a/14728706

(Optional) Collect garbage on a bare repo (on the remotes)

git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc

I hope this helps!

Untested ideas
Instead Step 1, just add

git lfs track "*.foo"
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git add .gitattributes .gitconfig

inside the for loop on step two. It should save that wasted initial rebase, and prevent the need for the git add .

Tested using git-lfs 0.5.1 and git 1.9.4.msysgit.1 (Yes, on windows 64 bit)

andyneff on 23 May 2015

👍29 ❤3 🎉1

bfg is much more efficient than git-filter-branch, especially for long histories.
[https://rtyley.github.io/bfg-repo-cleaner/]

Here's roughly what I did to convert all my .mov files to lfs:

cp *.mov (and a few other large blob types) ~/tmp
git rm *.mov
git commit
git lfs track *.mov
git add .gitattributes
git commit; git push

In a fresh directory:

git clone --mirror $remote; cd repo
bfg --delete-files '*.mov'
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

Back to my src directory:

mv repo repo.bloated
git clone $remote; cd repo
cp ~/tmp/*.mov .
git add *.mov (it now puts them in lfs)
git commit; git push

Kind of a chore to figure out, but now my repo is small and zippy.
Tip: do this on a cloud machine instead of your laptop, since most of the time is pulling/pushing data to github.

tlbtlbtlb on 24 May 2015

👍23

@tlbtlbtlb in your solution, when you check out all the previous versions of your history, are all the .mov files correctly there, or are the missing everywhere except the newest commits?

andyneff on 24 May 2015

They're gone from previous versions. Which is what's necessary to cut the size of the repository.

tlbtlbtlb on 24 May 2015

I guess I was being stubborn and trying to give anyone an option where they DON'T loose the history of all large files, even though it was stated he'd be happy without them.

I guess the slower build-in equivalent to bfg --delete-files '*.mov' would be

git filter-branch --prune-empty --index-filter 'git rm --ignore-unmatch --cached "*.mov"'

andyneff on 24 May 2015

Although I'm happy to loose the history of the large files, I would need them to continue to exist in previous commits though - No point having any history at all if every commit previous is broken with missing files.

strich on 24 May 2015

@strich I agree with the second statement, I'm a little confused about the first. Are you saying that you are ok with losing the contents AND history of any large files NOT on the tip of your current branch, but the version of the large files that are there, you want them to exist in previous commits up until that version of the file, where before that it will just not exist (be a missing file)?

Example:

If I understand correctly, are you saying that if I have something like this,

| commit | what happened | big contents |
| --- | --- | --- |
| master | nothing new with big files | big1.bin(version 2) |
| master~1 | update big1.bin to version 2, delete big2.bin | big1.bin(version 2) |
| master~2 | add big2.bin | big1.bin(version 1) and big2.bin |
| master~3 | add big1.bin (version 1) | big1.bin (version 1) |

You want to make sure that master and master~1 both point to big1.bin(version 2), but you are ok with big2.bin and big1.bin(version 1) just disappearing?

andyneff on 24 May 2015

Thanks for continuing to reply @andyneff - Yes that is exactly what I'd like to achieve if it is possible.

strich on 24 May 2015

Oh, its possible... My current solution actually preserves ALL the files (so big1.bin(version 1), big1.bin(version 2) AND big2.bin). You you are actually asking for a less, more limited version. It is possible to just use the latest commit if you add a little more

git lfs track "*.npy"
export KEEP_FILE="$(git rev-parse --show-toplevel)/.git/.keep"
git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/" > ${KEEP_FILE}
rm .gitattributes
git checkout .gitattributes  || :

git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git lfs track "*.npy"
git add .gitattributes .gitconfig

for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/"); do
  keep_file=0
  while read keep; do
    if [ "${keep}" == "${file}" ]; then
      keep_file=1
      break
    fi
  done < ${KEEP_FILE}
  if [[ ${keep_file} == 1 ]]; then
    git rm --cached ${file}
    git add ${file}
  else
    git rm -f ${file}
  fi
done' --tag-name-filter cat -- --all

rm ${KEEP_FILE}
rm .git/refs/original -rv
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc

This will remove all files matching the lfs track patterns, (in this case, *.npy) except for those listed in the KEEP_FILE file. The files listed in KEEP_FILE and their history will be maintained.

Of course, you can replace the first 5 lines with anything you want to get the list of files you want to keeps.

Side effect: It is possible that was two separate commit will be merged into one, if the only difference was a file that is now gone. This can also change the topology of your branches too. It will still be "as correct as possible", only some commit messages would disappear. Of course, it is unlikely it will happen, but just an FYI

andyneff on 25 May 2015

Other things I've tried that did NOT work out

I had this idea of using git-replace. This has an advantage of efficiently creating all the lfs pointers once and only once. This WORKS (after a filter-branch to update .gitconfig/.gitattribute), but I have a hard time making the changes permanent. git-filter-branch does not appear to support blob replacement (only commit replacement).
while read line; do
LINE=($line)
OBJECT_HASH=${LINE[0]}
FILENAME="${LINE[@]:1}"
MATCH=$(git check-attr filter ${FILENAME} | grep 'filter: lfs' | sed -r 's/(._): filter: lfs/1/')
if [ ! "${MATCH}" == "" ]; then
POINTER_HASH=$(git show ${OBJECT_HASH} | git lfs clean | git hash-object -w --stdin)
echo New file ${FILENAME} ${OBJECT_HASH} ${POINTER_HASH}
git replace ${OBJECT_HASH} ${POINTER_HASH}
fi
done < <(git rev-list --objects --all | grep "[0-9a-f]_ .")
Solutions involving git rebase --interactive, this is possible, but only works for one branch. Plus it is prone to merge conflicts in cases where your repo uses .gitattributes
env GIT_SEQUENCE_EDITOR="sed -i s:^pick:edit:" git rebase --interactive --root
while : ; do
git lfs ls-files > .gitlfs_converted
for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/1/"); do
new_file=1
while read converted; do
if [ "${converted}" == "${file}" ]; then
new_file=0
fi
done < .gitlfs_converted
if [[ ${new_file} == 1 ]]; then
echo "Processing ${file}"
git rm --cached ${file}
echo "Adding $file"
git add ${file}
fi
done
git commit --no-edit --amend
rm .gitlfs_converted
git rebase --continue || break
done

andyneff on 25 May 2015

I write simple java code which can convert repository for LFS usage: https://github.com/bozaro/git-lfs-migrate
May be it be usefull for somebody.

bozaro on 10 Jun 2015

👍3

Version of the first script that works on OS X with files that contains spaces (but not newline characters):

git filter-branch --prune-empty --tree-filter '
git lfs track "*.zip"
git lfs track "*.exe"
git add .gitattributes

git ls-files -z | xargs -0 git check-attr filter | grep "filter: lfs" | sed -E "s/(.*): filter: lfs/\1/" | tr "\n" "\0" | while read -r -d $'"'\0'"' file; do
    echo "Processing ${file}"

    git rm -f --cached "${file}"
    echo "Adding $file lfs style"
    git add "${file}"
done

' --tag-name-filter cat -- --all

The most unusual part is while read -r -d $'"'\0'"'

The parameter to read, -d, is $'\0', but to escape the single quote inside the block that is already single quoted, we end the quote, open a new double quote, then use single quotes, close the double quote, and then open a single quote back up. Shell escaping wasn't always the easiest..

vmrob on 12 Aug 2015

👍5

As @tlbtlbtlb mentioned, the BFG is much faster than git filter-branch for rewriting history - and I've added explicit Git LFS support with BFG v1.12.5:

$ bfg --convert-to-git-lfs '*.{exe,dll}' --no-blob-protection

Incidentally, the git-lfs-migrate code by @bozaro is quite interesting - it looks like it does an equivalent job, and maybe at equivalent speed - will check it out when I get a chance.

rtyley on 2 Oct 2015

👍21 ❤6

@rtyley

I think both of these projects need to work roughly the same speed.
By the way, I'll stand in the common part of my projects (git-as-svn and git-lfs-migrate) in a separate git-lfs-java project.

bozaro on 2 Oct 2015

@andyneff did you keep your original migration script around? I'd be _very_ interested :)

I'm asking as I'm in the same situation you had: I have a whole lot of binary files in a submodule, and I'd like to merge the submodule to the main repository while converting the files inside to LFS (and keep the whole history of course). The script you posted doesn't seem to deal with this, as this wasn't requested in this issue (I admit I didn't test it yet).

ltrzesniewski on 13 Oct 2015

@ltrzesniewski When I said "I maintained my submodule history" what I meant I kept the submodules hosted, just abandoned using them for future commits. This means if I went back in history before the conversion, it would checkout the submodule and use it. This is very clunky and not really a great idea...

I believe the new preferred way as @rtyley pointed out, is using bfg to convert a repo to lfs, now that it has lfs support.

So I believe what you could do is

Convert each of your your submodule repos to lfs using bfg (sorry, I haven't tried bfg yet)
You can actually merge in the submodule's history into you main repo history using git subtree I think. I've only ever done this once, and if I remember correctly it simply connects the two histories at the current commit

  /   \
M~1    S
 |     |
M~2   S~1
 |     |
M~3   S~2
...

Where M are your main repo commits, and S are submodule commits. This means that the versions of the submodule the M~1... commits point to won't be lined up with the new S* commits. In fact M~1.. will still point to the submodules. Neither solution is ideal.

To summarize, I only know of two tricks

Just remove the submodule in your current commit, and then add all the files back tracking git lfs files. This way when you checkout M~1 ... commits, it'll re-checkout the submodule version. You'll have to re-init the submodules and submodule update each time, and it'll probably break requiring you to manually help submodule each time, but it is doable. This also requires you to continue hosting the submodule repos
Use subtree to merge the history in. and just remember that the versions M~1... will be non-functional.
The PERFECT solution would probably be something that could line up each sha from the submodules and basically shuffle the commits together. Perhaps a smart filter script could do this, but I don't have one. What I'd imagine is

  /   
M~1
 |   \
M~2    S
 |   \
M~3   S~1
 |     |
 |    S~2
 |
M~4
 |  \
 |    S~3  
...

The commit from M~1 pointed to S,
The commit from M~2 pointed to S~1, so it gets both S~1 and S~2 because M~3 pointed to S~2
The commit from M~3 and M~4 pointed to S~3
etc...

I see references to another merge method I'm unfamiliar with, maybe it can help you.

http://stackoverflow.com/a/8901691/4166604
http://x3ro.de/2013/09/01/Integrating-a-submodule-into-the-parent-repository.html

andyneff on 13 Oct 2015

@andyneff thanks for your help, I appreciate it very much!

Now I understand what you meant in your first post, I guess I just got prone to wishful thinking but you cleared up the confusion - I basically thought you've already got a solution for that _PERFECT_ method you describe :wink:

In my case I don't really need to keep the commit history of the submodule, but I need to keep track of the relevant file versions referenced by the main repo, so I was thinking about performing a submodule checkout for each tree in the --tree-filter script but that would be _sloooooow_. I'll see what I can do from there.

ltrzesniewski on 13 Oct 2015

@andyneff
If you fetch both repository to some .git directory or intergrate submodule before convert, then bozaro/git-lfs-migrate will convert whole both repository history, including submodule links.

bozaro on 14 Oct 2015

I am in a slightly different situation where I have an orphan branch called design in a repo where we store sketch files and other assets. I am ok with losing the history since the project is less than a month old, and I already tried to do so but it doesn't seem to work.

Couple of questions:

Can I git track * ?
Do I have to git lfs init? If yes when? Everyone in the team needs to to that as well?
How do I check it's working? It appears it is not, it says compressing objects, writing objects as usual.

kilianc on 22 Oct 2015

@kilianc

git lfs init

This command need to run only one time on user computer. Usually it's runned by git-lfs installer.

This command add to $HOME/.gitconfig lines like:

[filter "lfs"]
    clean = git-lfs clean %f
    smudge = git-lfs smudge %f
    required = true

bozaro on 23 Oct 2015

I hope I'm not beating a dead horse, but..

This little script is still probably slower than bfg, but I couldn't figure out how to get bfg to honor my lfs remote location. So, I wanted to build on the work from @andyneff and @vmrob and make the filter-branch commands they provided faster.

git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://artifactory.local:8081/artifactory/api/lfs/git-lfs"
git lfs track "*.exe" "*.gz" "*.msi" "*.pdf" "*.ppt" "*.pptx" "*.rar" "*.vdx" "*.vsd" "*.war" "*.xls" "*.xlsm" "*.xlsx" "*.zip" > /dev/null
git add .gitattributes .gitconfig
git ls-files | xargs -d "\n" git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/" | xargs -d "\n" -r -n 50 bash -c "git rm -f --cached \"\$@\"; git add \"\$@\"" bash \
' --tag-name-filter cat -- --all

By combining the "git lfs track" line into one and by using "xargs -n 50" I was able to cut down on invoking git by more than 50 times per revision, in my case. (Way too many binaries in our repository!) That made things FAR faster... It handles spaces in the filenames also.

It seems to be working on Linux, but I can't comment on whether it would work for Mac OS X.

kalibyrn on 27 Nov 2015

@kalibyrn
I'm sure, that git-filter-branch is really bad idea. It's convert revision-by-revision with full checkout every revision.
Converting bare repo is much faster. I would recommend for you tool: https://github.com/bozaro/git-lfs-migrate

bozaro on 27 Nov 2015

git-lfs-migrate is amazing! Have converted a few repos, and done some superficial verification that the converted tags and HEADs are good.

Super easy to use and worked a treat. Thanks @bozaro !

jamesblackburn on 7 Jul 2016

👍1

Does git-lfs-migrate change commit hash numbers? I would like to migrate a repo with large files, but am afraid of using filter-branch. If all blobs are substituted by a pointer (text file) in the history, without changing the actual graph that would be perfect.

dashesy on 21 Jul 2016

👍1

Because the file object blobs do change (from a large file, to a text pointer) it by the design of git does change the blob SHA, and therefore the commit SHA. The result is there isn't a way, in git, of changing the content of blobs without changing the SHAs.

jamesblackburn on 21 Jul 2016

👍6

@jamesblackburn But git could add a feature (or plugin) to fake the SHA for some _special_ blobs (blobs that have their SHA hard-coded). Problem with changing the commit SHA is that you suddenly lose all of the references to old commits (issue trackers, wikis, urls .. become invalid).

dashesy on 21 Jul 2016

@dashesy

Faking the sha would be VERY bad (if it were even possible). The SHAs are different, they NEED to be pulled down. If they looked the same, other people fetching the latest version wouldn't know they need the new SHAs

As for your issue tracking, etc... problem... Yes, those would be broken. There is a git replace and git graphs feature... I'm not sure if those could help, and I don't think git lfs migrate uses those. It might be possible to keep a list of all the old shas replaced by new shas with that feature. However, I'm not sure what you would do with that list when you are done...

You are justified so be worried about all the SHAs changing, but this is necessary.

The entire graph (at least for the branch you convert over) still retain its original topology, only ALL the SHAs in the graph will change (as of the first lfs file, at least). I don't remember if git-lfs-migrate convert all your branches or not.

So a few points

Yes, you'll be pushing a whole new set of SHAs
Yes, you will be LOSING the original SHAs, but the topology and commit history will remain. The only difference will be large files will be lfs pointers in the git repository instead of large files. This means your repo will be smaller.
Yes, all the other team members will need to pull down the new changes, and make sure they are branched off of the new SHAs, and not the old one. This means they can't use git merge (unless you use --squash) to get to the new SHAs, it either has to be git rebase, git reset, git cherry-pick, or hopefully git checkout, all depending on the situation ;)
Depending on who all is using your repo, complexity, etc... you may want to go the route of having everyone just git fetch and then checkout the new SHAs, or just tell everyone to make a clean clone (and possible change the repo name just to try and prevent confusion... Of course that sometimes add confusion too)
git-lfs-migrate is the preferred way now. You shouldn't be needing filter-branch anymore... I hope

andyneff on 21 Jul 2016

git-lfs-migrate has an option now for mapping old commit hashes to new ones. I'm linking to the pull request for adding --annotate-id-prefix since I installed from a non-master branch, but it may be in master at some point soon: https://github.com/bozaro/git-lfs-migrate/pull/24

Permafacture on 30 Dec 2016

👍1

@Permafacture Any idea if this option exists in this project's migrate tool?

revolter on 18 Sep 2017

Isn't git lfs migrate the best option for this?

revolter on 18 Sep 2017

👍3

@revolter it didn't exist when this thread was created.

FWIW I tried to use it to convert a large repo just after it was released, but with no luck. @bozaro's tool worked just fine, although I had to revert the "Dramatically reduce memory usage" commit (so it used lots of memory but it actually finished before the heat death of the universe 😉). Maybe now the issues are ironed out, I don't know.

ltrzesniewski on 18 Sep 2017

👍1

LFS v2.3.0 improves migrate perf and fixes some crucial bugs. So, hopefully it works better now :)

technoweenie on 18 Sep 2017

👍3

Haven't looked into it, but I don't see any documentation saying this
feature (updating all the old commit messages to include the original hash)
exists with that tool.

Permafacture on 19 Sep 2017

Was this page helpful?

0 / 5 - 0 ratings