Gitea Version: 1.0.0+76-geb9ce39
$ git --version
git version 2.7.4
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
[x]
):When merging a PR, Gitea will clone the whole repository, which is unacceptably time-consuming when the repository is large.
See https://github.com/go-gitea/gitea/blob/master/models/pull.go#L285
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
just changing the clone-path to owner/repo.{{.Time}}.git
instead would speed it up considerably as git clone
on the same mount-point is just a hardlink, also slaping on --local
could speed it up in some cases.
Further improvements include --single-branch
or --depth 1
to only fetch the branch you're merging into
--local, -l
When the repository to clone from is on a local machine, this flag bypasses the normal "Git aware" transport mechanism and clones the repository by making a copy of HEAD and everything under objects and refs directories. The files under .git/objects/
directory are hardlinked to save space when possible.
--no-hardlinks
Force the cloning process from a repository on a local filesystem to copy the files under the .git/objects directory instead of using hardlinks. This may be desirable if you are trying to make a back-up of your repository.
http://stackoverflow.com/questions/10637584/how-to-do-a-non-fast-forward-git-merge-to-a-branch-that-isnt-checked-out might be useful as well.
Edit: I have a patch experimenting with that, will submit it later.
@typeless not sure if it works for bare repos though 馃檨
For reference: https://gist.github.com/cool-RR/6575042
In my case is a little slow too, repo is very big, an solution is add gauge/loading image for feedback to user. It's possible?
If yes, is good add too in apply merge pull request, what is very slow too e no exixsts feedback to user, only the loading in the browser title.
My repo size is: 1,8 GB.
If we could refactor so that there is no local clone operation on server when merging I think it would significantly decrease merging time. Or at least keep it permanently and just reset hard from origin when needed and in such case all its operations must be protected by locking
I once attempted to fix the issue by combining some plumbing commands but that approach had some critical flaws. It seems that the better solution would be to fix it in the Git upstream.
I don't like that at present we're basically forced to create arbitrary paths on the server - it smells dangerous to me, even though we should have protected against climbing out of the root. I wouldn't be surprised if there were genuine issues on Windows regarding this e.g. reserved filenames, filename length, reserved characters and the like. I could also imagine a filename encoding problem to affect linux at some point. These problems similarly affect uploads.
I'm just gonna put some details about how git works down here so they don't disappear.
git commit
work?git add
and the likegit write-tree
with GIT_INDEX_FILE
set appropriately to write the index as a tree object to the git objects db saving the stdout'd $tree_id
for the next stepgit commit-tree $tree_id
with the parent ids prepended with -p
and commit message through stdin, outputs the $commit_id
git update-ref HEAD $commit_id $CURRENT_HEAD
to ensure that when we change the HEAD it's actually not been changed under our feetgit gc --auto
and deal with post-commit and some info.git add "$FILE"
work?git hash-object -w "$FILE"
outputting the $BLOB_ID
cat "$FILE_DATA" | git hash-object -w --path="$FILE_PATH" --stdin
git update-index --add --info-only "$FILE"
git update-index --add --cacheinfo $MODE,$BLOB_ID,$FILE_PATH
- meaning you don't need the file to be at the supposed path.git read-tree HEAD
or git ls-tree -r $TREE | git update-index --index-info
to make the index $TREE before doing anything, to set the index to map HEAD) - otherwise the tree you commit will only contain what you've added.git rm
is similar, and there's also --replace
in the git update-index
to force changes.I need to look at how merge actually works - I think we should be able to generate the merged data directly.
Using plumbing it should be possible to actually not do a physical checkout at all and create an index in a bare repository that shares the object dbs of the repositories with the two HEADS we're merging from.
E.g. We create an index from HEAD_1. Then merge each changed file in HEAD_2 into HEAD_1s version in turn - hashing each new object and updating the index as appropriate. We finally write the index as a tree, commit the tree, and update the refs before pushing. The merging files don't need to be written to a working copy - just a temporary file. Similarly the shared repositories can be in a temp directory and the index is a temporary file too. At no point do we actually try to checkout a repo and if a user wants to make their own repo difficult by sticking things in .git/ they can (You can't commit any path with a .git in it from the git cli).
Finally got round to looking at the source code for git -- verify_path which is run on adding to index disallows the committing of any .git directories (https://github.com/git/git/blob/7f4e64169352e03476b0ea64e7e2973669e491a2/read-cache.c#L951) although I haven't checked completely I suspect a malicious repo that contained a tree with a ".git" directory would be similarly protected by verify_path.
@zeripath Btw, I had a failed attempt following in this line of thinking https://github.com/go-gitea/gitea/pull/641.
Maybe it's actually feasible given sufficient knowledge of the git plumbing and internals.
I think your current approach should reduce the load considerably. Certainly the use of shared object databases is extremely important and would be responsible for improvements in load. It would be worth a pr in itself. In fact, if you don't think #4921 is going to be ready in time for 1.7 I'd recommend doing exactly that. (It's worth noting that most repos have many more objects than checkout files and although hard links are relatively cheap they still need to be created and still change inodes.)
Just thinking on, although you've shared the base repository object database, when you add the remote that doesn't get shared too. It is possible to set up a repo to share multiple object databases. Could it be that one of these repos could have a much smaller object db, so when you fetch on the "remote" you actually copy a lot of objects unintentionally? This is likely to be an extreme edge case though.
Although libgit2 would likely provide an improvement - it still requires in depth understanding of the plumbing and most of GitHubs improvements come from that deeper knowledge of plumbing. Yes they saw benefits descriptifying in the way that libgit2 allows them to do - we still need to move off the porcelain and have a deeper understanding of the plumbing before we can go do that approach. Your current PR is a definite step towards deeper plumbing knowledge. I'm working on file upload at present and have a working - albeit not very pretty - solution for this.
It's also worth noting that the unique thing that libgit2 provides is the in memory index - the merging strategies come from this, but as a default index is a file it may be that we don't need this.
Certainly the use of shared object databases is extremely important and would be responsible for improvements in load.
We have deployed that bit on our server couple of months ago and it makes the difference between totally unusable (ie. the merge times out) and slow, but usuable.
The Gitea performance on Windows with our multi-gigabyte repositories is atrocious out of the box, but we run a forked version that makes it quite bearable. It uses go-git for accessing the repositories and often makes the pages load > 10x faster (in some cases > 30x). I was not brave enough to use go-git for any write operations, especially since the merging algorithm is quite complex, but using the shared object database in itself saved ~30 seconds on each merge on 3+ Gb repositories.
@filipnavara why no to share your changes to us back as I have also started to move parts of the read operations to go-git
@lafriks I shared the links in other issues/PRs. https://github.com/filipnavara/gitea/commits/commitgraph (also the perf-read branch). I will rebase it when I get back after new year.
Nice, will take a look again. Did perf improvement got merged in go-git?
Yep, go-git has merged my performance fixes.
The branch with serialized commit tree is stale at the moment, but it is not that important anyway. We use it on our server, but the performance improvements are relatively small compared to the rest of the changes.
Just thinking on, although you've shared the base repository object database, when you add the remote that doesn't get shared too. It is possible to set up a repo to share multiple object databases. Could it be that one of these repos could have a much smaller object db, so when you fetch on the "remote" you actually copy a lot of objects unintentionally? This is likely to be an extreme edge case though.
@zeripath Adding the path of the head repository to .git/objects/info/alternates
of the staging repository can probably address the problem of redundant git-fetch
.
there is a bounty on this issue: https://www.bountysource.com/issues/40674560-merging-prs-takes-very-long-time-for-large-repository
there is a bounty on this issue: https://www.bountysource.com/issues/40674560-merging-prs-takes-very-long-time-for-large-repository
It's closed. It's from 2017. 馃槃
it can be clamed ... that's what I like to remind
Most helpful comment
I don't like that at present we're basically forced to create arbitrary paths on the server - it smells dangerous to me, even though we should have protected against climbing out of the root. I wouldn't be surprised if there were genuine issues on Windows regarding this e.g. reserved filenames, filename length, reserved characters and the like. I could also imagine a filename encoding problem to affect linux at some point. These problems similarly affect uploads.
I'm just gonna put some details about how git works down here so they don't disappear.
git commit
work?git add
and the likegit write-tree
withGIT_INDEX_FILE
set appropriately to write the index as a tree object to the git objects db saving the stdout'd$tree_id
for the next stepgit commit-tree $tree_id
with the parent ids prepended with-p
and commit message through stdin, outputs the$commit_id
git update-ref HEAD $commit_id $CURRENT_HEAD
to ensure that when we change the HEAD it's actually not been changed under our feetgit gc --auto
and deal with post-commit and some info.git add "$FILE"
work?git hash-object -w "$FILE"
outputting the$BLOB_ID
cat "$FILE_DATA" | git hash-object -w --path="$FILE_PATH" --stdin
git update-index --add --info-only "$FILE"
git update-index --add --cacheinfo $MODE,$BLOB_ID,$FILE_PATH
- meaning you don't need the file to be at the supposed path.git read-tree HEAD
orgit ls-tree -r $TREE | git update-index --index-info
to make the index $TREE before doing anything, to set the index to map HEAD) - otherwise the tree you commit will only contain what you've added.git rm
is similar, and there's also--replace
in thegit update-index
to force changes.I need to look at how merge actually works - I think we should be able to generate the merged data directly.
Using plumbing it should be possible to actually not do a physical checkout at all and create an index in a bare repository that shares the object dbs of the repositories with the two HEADS we're merging from.
E.g. We create an index from HEAD_1. Then merge each changed file in HEAD_2 into HEAD_1s version in turn - hashing each new object and updating the index as appropriate. We finally write the index as a tree, commit the tree, and update the refs before pushing. The merging files don't need to be written to a working copy - just a temporary file. Similarly the shared repositories can be in a temp directory and the index is a temporary file too. At no point do we actually try to checkout a repo
and if a user wants to make their own repo difficult by sticking things in .git/ they can(You can't commit any path with a .git in it from the git cli).Finally got round to looking at the source code for git -- verify_path which is run on adding to index disallows the committing of any .git directories (https://github.com/git/git/blob/7f4e64169352e03476b0ea64e7e2973669e491a2/read-cache.c#L951) although I haven't checked completely I suspect a malicious repo that contained a tree with a ".git" directory would be similarly protected by verify_path.