I'd like to give Git LFS a good crack with our existing ~25GB Git repo but I'm at a bit of a loss as to how we should go about migrating it to support LFS.
I'm happy to loose all history of the existing large files I'd like to track with LFS.
Do I need to write a script to find and purge all the filetypes I want to track with LFS from the existing Git history, untrack them, then retrack them with LFS?
I'm HOPING there is a better way to do this... BUT after redacted units of time, I found a way where you will NOT lose any history of existing large files. It is probably SLOW but it works.
I have run into a similar situation myself, but in my case, all my large files were sitting in submodules. So I was able to remove the submodules, track files, and add them back (without the submodules, HORRAY!) This way I can say I did maintain all my history, only some of the history is in the submodules. I'm guessing this is not similar to your situation, and that everything is in one big repo.
1) Filter-branch to convert all tracking over to lfs
git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git lfs track "*.npy"
git add .gitattributes .gitconfig
for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/"); do
echo "Processing ${file}"
git rm -f --cached ${file}
echo "Adding $file lfs style"
git add ${file}
done' --tag-name-filter cat -- --all
git config
lines are so that if you have an lfs server in a separate location than the git repo, everything works. IF you have them at the same url, you can skip that step2) Push changes to whatever remote or remotes you have, of course you have to use the -f option, which can (and WILL!) have many implications to all the other users using the repo. Make sure no one else pushes or references the commits you just rebased, or else you will have a mess. This is when all the large files are sent to the lfs server.
git push -f origin master
(Optional) Collect garbage to shrink currently checked out repo
rm .git/refs/original -rf
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc
(Optional) Collect garbage on a bare repo (on the remotes)
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc
I hope this helps!
Untested ideas
Instead Step 1, just add
git lfs track "*.foo"
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git add .gitattributes .gitconfig
inside the for loop on step two. It should save that wasted initial rebase, and prevent the need for the git add .
Tested using git-lfs 0.5.1 and git 1.9.4.msysgit.1 (Yes, on windows 64 bit)
bfg is much more efficient than git-filter-branch, especially for long histories.
[https://rtyley.github.io/bfg-repo-cleaner/]
Here's roughly what I did to convert all my .mov files to lfs:
In a fresh directory:
Back to my src directory:
Kind of a chore to figure out, but now my repo is small and zippy.
Tip: do this on a cloud machine instead of your laptop, since most of the time is pulling/pushing data to github.
@tlbtlbtlb in your solution, when you check out all the previous versions of your history, are all the .mov files correctly there, or are the missing everywhere except the newest commits?
They're gone from previous versions. Which is what's necessary to cut the size of the repository.
I guess I was being stubborn and trying to give anyone an option where they DON'T loose the history of all large files, even though it was stated he'd be happy without them.
I guess the slower build-in equivalent to bfg --delete-files '*.mov'
would be
git filter-branch --prune-empty --index-filter 'git rm --ignore-unmatch --cached "*.mov"'
Although I'm happy to loose the history of the large files, I would need them to continue to exist in previous commits though - No point having any history at all if every commit previous is broken with missing files.
@strich I agree with the second statement, I'm a little confused about the first. Are you saying that you are ok with losing the contents AND history of any large files NOT on the tip of your current branch, but the version of the large files that are there, you want them to exist in previous commits up until that version of the file, where before that it will just not exist (be a missing file)?
Example:
If I understand correctly, are you saying that if I have something like this,
| commit | what happened | big contents |
| --- | --- | --- |
| master | nothing new with big files | big1.bin(version 2) |
| master~1 | update big1.bin to version 2, delete big2.bin | big1.bin(version 2) |
| master~2 | add big2.bin | big1.bin(version 1) and big2.bin |
| master~3 | add big1.bin (version 1) | big1.bin (version 1) |
You want to make sure that master and master~1 both point to big1.bin(version 2), but you are ok with big2.bin and big1.bin(version 1) just disappearing?
Thanks for continuing to reply @andyneff - Yes that is exactly what I'd like to achieve if it is possible.
Oh, its possible... My current solution actually preserves ALL the files (so big1.bin(version 1), big1.bin(version 2) AND big2.bin). You you are actually asking for a less, more limited version. It is possible to just use the latest commit if you add a little more
git lfs track "*.npy"
export KEEP_FILE="$(git rev-parse --show-toplevel)/.git/.keep"
git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/" > ${KEEP_FILE}
rm .gitattributes
git checkout .gitattributes || :
git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git lfs track "*.npy"
git add .gitattributes .gitconfig
for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/"); do
keep_file=0
while read keep; do
if [ "${keep}" == "${file}" ]; then
keep_file=1
break
fi
done < ${KEEP_FILE}
if [[ ${keep_file} == 1 ]]; then
git rm --cached ${file}
git add ${file}
else
git rm -f ${file}
fi
done' --tag-name-filter cat -- --all
rm ${KEEP_FILE}
rm .git/refs/original -rv
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc
This will remove all files matching the lfs track patterns, (in this case, *.npy) except for those listed in the KEEP_FILE file. The files listed in KEEP_FILE and their history will be maintained.
Of course, you can replace the first 5 lines with anything you want to get the list of files you want to keeps.
Side effect: It is possible that was two separate commit will be merged into one, if the only difference was a file that is now gone. This can also change the topology of your branches too. It will still be "as correct as possible", only some commit messages would disappear. Of course, it is unlikely it will happen, but just an FYI
Other things I've tried that did NOT work out
git rebase --interactive
, this is possible, but only works for one branch. Plus it is prone to merge conflicts in cases where your repo uses .gitattributesI write simple java code which can convert repository for LFS usage: https://github.com/bozaro/git-lfs-migrate
May be it be usefull for somebody.
Version of the first script that works on OS X with files that contains spaces (but not newline characters):
git filter-branch --prune-empty --tree-filter '
git lfs track "*.zip"
git lfs track "*.exe"
git add .gitattributes
git ls-files -z | xargs -0 git check-attr filter | grep "filter: lfs" | sed -E "s/(.*): filter: lfs/\1/" | tr "\n" "\0" | while read -r -d $'"'\0'"' file; do
echo "Processing ${file}"
git rm -f --cached "${file}"
echo "Adding $file lfs style"
git add "${file}"
done
' --tag-name-filter cat -- --all
The most unusual part is while read -r -d $'"'\0'"'
The parameter to read, -d, is $'\0'
, but to escape the single quote inside the block that is already single quoted, we end the quote, open a new double quote, then use single quotes, close the double quote, and then open a single quote back up. Shell escaping wasn't always the easiest..
As @tlbtlbtlb mentioned, the BFG is much faster than git filter-branch
for rewriting history - and I've added explicit Git LFS support with BFG v1.12.5:
$ bfg --convert-to-git-lfs '*.{exe,dll}' --no-blob-protection
Incidentally, the git-lfs-migrate
code by @bozaro is quite interesting - it looks like it does an equivalent job, and maybe at equivalent speed - will check it out when I get a chance.
@rtyley
I think both of these projects need to work roughly the same speed.
By the way, I'll stand in the common part of my projects (git-as-svn and git-lfs-migrate) in a separate git-lfs-java project.
@andyneff did you keep your original migration script around? I'd be _very_ interested :)
I'm asking as I'm in the same situation you had: I have a whole lot of binary files in a submodule, and I'd like to merge the submodule to the main repository while converting the files inside to LFS (and keep the whole history of course). The script you posted doesn't seem to deal with this, as this wasn't requested in this issue (I admit I didn't test it yet).
@ltrzesniewski When I said "I maintained my submodule history" what I meant I kept the submodules hosted, just abandoned using them for future commits. This means if I went back in history before the conversion, it would checkout the submodule and use it. This is very clunky and not really a great idea...
I believe the new preferred way as @rtyley pointed out, is using bfg to convert a repo to lfs, now that it has lfs support.
So I believe what you could do is
/ \
M~1 S
| |
M~2 S~1
| |
M~3 S~2
...
Where M are your main repo commits, and S are submodule commits. This means that the versions of the submodule the M~1... commits point to won't be lined up with the new S* commits. In fact M~1.. will still point to the submodules. Neither solution is ideal.
To summarize, I only know of two tricks
/
M~1
| \
M~2 S
| \
M~3 S~1
| |
| S~2
|
M~4
| \
| S~3
...
I see references to another merge method I'm unfamiliar with, maybe it can help you.
http://stackoverflow.com/a/8901691/4166604
http://x3ro.de/2013/09/01/Integrating-a-submodule-into-the-parent-repository.html
@andyneff thanks for your help, I appreciate it very much!
Now I understand what you meant in your first post, I guess I just got prone to wishful thinking but you cleared up the confusion - I basically thought you've already got a solution for that _PERFECT_ method you describe :wink:
In my case I don't really need to keep the commit history of the submodule, but I need to keep track of the relevant file versions referenced by the main repo, so I was thinking about performing a submodule checkout for each tree in the --tree-filter
script but that would be _sloooooow_. I'll see what I can do from there.
@andyneff
If you fetch both repository to some .git directory or intergrate submodule before convert, then bozaro/git-lfs-migrate will convert whole both repository history, including submodule links.
I am in a slightly different situation where I have an orphan branch called design in a repo where we store sketch files and other assets. I am ok with losing the history since the project is less than a month old, and I already tried to do so but it doesn't seem to work.
Couple of questions:
git track *
?git lfs init
? If yes when? Everyone in the team needs to to that as well?@kilianc
git lfs init
This command need to run only one time on user computer. Usually it's runned by git-lfs installer.
This command add to $HOME/.gitconfig lines like:
[filter "lfs"]
clean = git-lfs clean %f
smudge = git-lfs smudge %f
required = true
I hope I'm not beating a dead horse, but..
This little script is still probably slower than bfg, but I couldn't figure out how to get bfg to honor my lfs remote location. So, I wanted to build on the work from @andyneff and @vmrob and make the filter-branch commands they provided faster.
git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://artifactory.local:8081/artifactory/api/lfs/git-lfs"
git lfs track "*.exe" "*.gz" "*.msi" "*.pdf" "*.ppt" "*.pptx" "*.rar" "*.vdx" "*.vsd" "*.war" "*.xls" "*.xlsm" "*.xlsx" "*.zip" > /dev/null
git add .gitattributes .gitconfig
git ls-files | xargs -d "\n" git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/" | xargs -d "\n" -r -n 50 bash -c "git rm -f --cached \"\$@\"; git add \"\$@\"" bash \
' --tag-name-filter cat -- --all
By combining the "git lfs track" line into one and by using "xargs -n 50" I was able to cut down on invoking git by more than 50 times per revision, in my case. (Way too many binaries in our repository!) That made things FAR faster... It handles spaces in the filenames also.
It seems to be working on Linux, but I can't comment on whether it would work for Mac OS X.
@kalibyrn
I'm sure, that git-filter-branch
is really bad idea. It's convert revision-by-revision with full checkout every revision.
Converting bare repo is much faster. I would recommend for you tool: https://github.com/bozaro/git-lfs-migrate
git-lfs-migrate
is amazing! Have converted a few repos, and done some superficial verification that the converted tags and HEADs are good.
Super easy to use and worked a treat. Thanks @bozaro !
Does git-lfs-migrate
change commit hash numbers? I would like to migrate a repo with large files, but am afraid of using filter-branch. If all blobs are substituted by a pointer (text file) in the history, without changing the actual graph that would be perfect.
Because the file object blobs do change (from a large file, to a text pointer) it by the design of git does change the blob SHA, and therefore the commit SHA. The result is there isn't a way, in git, of changing the content of blobs without changing the SHAs.
@jamesblackburn But git could add a feature (or plugin) to fake the SHA for some _special_ blobs (blobs that have their SHA hard-coded). Problem with changing the commit SHA is that you suddenly lose all of the references to old commits (issue trackers, wikis, urls .. become invalid).
@dashesy
Faking the sha would be VERY bad (if it were even possible). The SHAs are different, they NEED to be pulled down. If they looked the same, other people fetching the latest version wouldn't know they need the new SHAs
As for your issue tracking, etc... problem... Yes, those would be broken. There is a git replace and git graphs feature... I'm not sure if those could help, and I don't think git lfs migrate
uses those. It might be possible to keep a list of all the old shas replaced by new shas with that feature. However, I'm not sure what you would do with that list when you are done...
You are justified so be worried about all the SHAs changing, but this is necessary.
The entire graph (at least for the branch you convert over) still retain its original topology, only ALL the SHAs in the graph will change (as of the first lfs file, at least). I don't remember if git-lfs-migrate
convert all your branches or not.
So a few points
git merge
(unless you use --squash
) to get to the new SHAs, it either has to be git rebase
, git reset
, git cherry-pick
, or hopefully git checkout
, all depending on the situation ;)git fetch
and then checkout the new SHAs, or just tell everyone to make a clean clone (and possible change the repo name just to try and prevent confusion... Of course that sometimes add confusion too)git-lfs-migrate
is the preferred way now. You shouldn't be needing filter-branch
anymore... I hopegit-lfs-migrate has an option now for mapping old commit hashes to new ones. I'm linking to the pull request for adding --annotate-id-prefix since I installed from a non-master branch, but it may be in master at some point soon: https://github.com/bozaro/git-lfs-migrate/pull/24
@Permafacture Any idea if this option exists in this project's migrate
tool?
Isn't git lfs migrate
the best option for this?
@revolter it didn't exist when this thread was created.
FWIW I tried to use it to convert a large repo just after it was released, but with no luck. @bozaro's tool worked just fine, although I had to revert the "Dramatically reduce memory usage" commit (so it used lots of memory but it actually finished before the heat death of the universe 馃槈). Maybe now the issues are ironed out, I don't know.
LFS v2.3.0 improves migrate
perf and fixes some crucial bugs. So, hopefully it works better now :)
Haven't looked into it, but I don't see any documentation saying this
feature (updating all the old commit messages to include the original hash)
exists with that tool.
Most helpful comment
I'm HOPING there is a better way to do this... BUT after
redactedunits of time, I found a way where you will NOT lose any history of existing large files. It is probably SLOW but it works.DISCLAIMER I'm just a stubborn user, can't promise this will work out for you, BACKUP!
I have run into a similar situation myself, but in my case, all my large files were sitting in submodules. So I was able to remove the submodules, track files, and add them back (without the submodules, HORRAY!) This way I can say I did maintain all my history, only some of the history is in the submodules. I'm guessing this is not similar to your situation, and that everything is in one big repo.
1) Filter-branch to convert all tracking over to lfs
git config
lines are so that if you have an lfs server in a separate location than the git repo, everything works. IF you have them at the same url, you can skip that step2) Push changes to whatever remote or remotes you have, of course you have to use the -f option, which can (and WILL!) have many implications to all the other users using the repo. Make sure no one else pushes or references the commits you just rebased, or else you will have a mess. This is when all the large files are sent to the lfs server.
(Optional) Collect garbage to shrink currently checked out repo
(Optional) Collect garbage on a bare repo (on the remotes)
I hope this helps!
Untested ideas
Instead Step 1, just add
inside the for loop on step two. It should save that wasted initial rebase, and prevent the need for the
git add .
Tested using git-lfs 0.5.1 and git 1.9.4.msysgit.1 (Yes, on windows 64 bit)