Gitea: Merging PRs takes very long time for large repository

Created on 6 Jan 2017  路  19Comments  路  Source: go-gitea/gitea

  • Gitea version (or commit ref):

Gitea Version: 1.0.0+76-geb9ce39

  • Git version:
$ git --version
git version 2.7.4
  • Operating system:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
  • Database (use [x]):

    • [ ] PostgreSQL

    • [ ] MySQL

    • [x] SQLite

  • Can you reproduce the bug at https://try.gitea.io:

    • [ ] Yes (provide example URL)

    • [x] No, https://try.gitea.io will report out-of-memory error when pushing a large repository.

    • [ ] Not relevant

      Description


When merging a PR, Gitea will clone the whole repository, which is unacceptably time-consuming when the repository is large.

See https://github.com/go-gitea/gitea/blob/master/models/pull.go#L285

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

kinbug kinenhancement

Most helpful comment

I don't like that at present we're basically forced to create arbitrary paths on the server - it smells dangerous to me, even though we should have protected against climbing out of the root. I wouldn't be surprised if there were genuine issues on Windows regarding this e.g. reserved filenames, filename length, reserved characters and the like. I could also imagine a filename encoding problem to affect linux at some point. These problems similarly affect uploads.

I'm just gonna put some details about how git works down here so they don't disappear.

  • How does git commit work?

    • Basically it uses the index created using git add and the like

    • exports GIT_AUTHOR_*

    • git write-tree with GIT_INDEX_FILE set appropriately to write the index as a tree object to the git objects db saving the stdout'd $tree_id for the next step

    • git commit-tree $tree_id with the parent ids prepended with -p and commit message through stdin, outputs the $commit_id

    • git update-ref HEAD $commit_id $CURRENT_HEAD to ensure that when we change the HEAD it's actually not been changed under our feet

    • if successful git gc --auto and deal with post-commit and some info.

    • None of this actually depends on a working tree.

  • How does git add "$FILE" work?

    • The first step is to add the object to the db: git hash-object -w "$FILE" outputting the $BLOB_ID

    • This can be replicated by cat "$FILE_DATA" | git hash-object -w --path="$FILE_PATH" --stdin

    • You then need to add that object to the index, which is achieved by: git update-index --add --info-only "$FILE"

    • A functionally equivalent is: git update-index --add --cacheinfo $MODE,$BLOB_ID,$FILE_PATH - meaning you don't need the file to be at the supposed path.

    • (You should either git read-tree HEAD or git ls-tree -r $TREE | git update-index --index-info to make the index $TREE before doing anything, to set the index to map HEAD) - otherwise the tree you commit will only contain what you've added.

  • git rm is similar, and there's also --replace in the git update-index to force changes.

I need to look at how merge actually works - I think we should be able to generate the merged data directly.

Using plumbing it should be possible to actually not do a physical checkout at all and create an index in a bare repository that shares the object dbs of the repositories with the two HEADS we're merging from.

E.g. We create an index from HEAD_1. Then merge each changed file in HEAD_2 into HEAD_1s version in turn - hashing each new object and updating the index as appropriate. We finally write the index as a tree, commit the tree, and update the refs before pushing. The merging files don't need to be written to a working copy - just a temporary file. Similarly the shared repositories can be in a temp directory and the index is a temporary file too. At no point do we actually try to checkout a repo and if a user wants to make their own repo difficult by sticking things in .git/ they can (You can't commit any path with a .git in it from the git cli).


Finally got round to looking at the source code for git -- verify_path which is run on adding to index disallows the committing of any .git directories (https://github.com/git/git/blob/7f4e64169352e03476b0ea64e7e2973669e491a2/read-cache.c#L951) although I haven't checked completely I suspect a malicious repo that contained a tree with a ".git" directory would be similarly protected by verify_path.

All 19 comments

just changing the clone-path to owner/repo.{{.Time}}.git instead would speed it up considerably as git clone on the same mount-point is just a hardlink, also slaping on --local could speed it up in some cases.

Further improvements include --single-branch or --depth 1 to only fetch the branch you're merging into

--local, -l
When the repository to clone from is on a local machine, this flag bypasses the normal "Git aware" transport mechanism and clones the repository by making a copy of HEAD and everything under objects and refs directories. The files under .git/objects/
directory are hardlinked to save space when possible.

--no-hardlinks
Force the cloning process from a repository on a local filesystem to copy the files under the .git/objects directory instead of using hardlinks. This may be desirable if you are trying to make a back-up of your repository.

http://stackoverflow.com/questions/10637584/how-to-do-a-non-fast-forward-git-merge-to-a-branch-that-isnt-checked-out might be useful as well.

Edit: I have a patch experimenting with that, will submit it later.

@typeless not sure if it works for bare repos though 馃檨

In my case is a little slow too, repo is very big, an solution is add gauge/loading image for feedback to user. It's possible?

If yes, is good add too in apply merge pull request, what is very slow too e no exixsts feedback to user, only the loading in the browser title.

My repo size is: 1,8 GB.

If we could refactor so that there is no local clone operation on server when merging I think it would significantly decrease merging time. Or at least keep it permanently and just reset hard from origin when needed and in such case all its operations must be protected by locking

I once attempted to fix the issue by combining some plumbing commands but that approach had some critical flaws. It seems that the better solution would be to fix it in the Git upstream.

See also https://githubengineering.com/move-fast/

I don't like that at present we're basically forced to create arbitrary paths on the server - it smells dangerous to me, even though we should have protected against climbing out of the root. I wouldn't be surprised if there were genuine issues on Windows regarding this e.g. reserved filenames, filename length, reserved characters and the like. I could also imagine a filename encoding problem to affect linux at some point. These problems similarly affect uploads.

I'm just gonna put some details about how git works down here so they don't disappear.

  • How does git commit work?

    • Basically it uses the index created using git add and the like

    • exports GIT_AUTHOR_*

    • git write-tree with GIT_INDEX_FILE set appropriately to write the index as a tree object to the git objects db saving the stdout'd $tree_id for the next step

    • git commit-tree $tree_id with the parent ids prepended with -p and commit message through stdin, outputs the $commit_id

    • git update-ref HEAD $commit_id $CURRENT_HEAD to ensure that when we change the HEAD it's actually not been changed under our feet

    • if successful git gc --auto and deal with post-commit and some info.

    • None of this actually depends on a working tree.

  • How does git add "$FILE" work?

    • The first step is to add the object to the db: git hash-object -w "$FILE" outputting the $BLOB_ID

    • This can be replicated by cat "$FILE_DATA" | git hash-object -w --path="$FILE_PATH" --stdin

    • You then need to add that object to the index, which is achieved by: git update-index --add --info-only "$FILE"

    • A functionally equivalent is: git update-index --add --cacheinfo $MODE,$BLOB_ID,$FILE_PATH - meaning you don't need the file to be at the supposed path.

    • (You should either git read-tree HEAD or git ls-tree -r $TREE | git update-index --index-info to make the index $TREE before doing anything, to set the index to map HEAD) - otherwise the tree you commit will only contain what you've added.

  • git rm is similar, and there's also --replace in the git update-index to force changes.

I need to look at how merge actually works - I think we should be able to generate the merged data directly.

Using plumbing it should be possible to actually not do a physical checkout at all and create an index in a bare repository that shares the object dbs of the repositories with the two HEADS we're merging from.

E.g. We create an index from HEAD_1. Then merge each changed file in HEAD_2 into HEAD_1s version in turn - hashing each new object and updating the index as appropriate. We finally write the index as a tree, commit the tree, and update the refs before pushing. The merging files don't need to be written to a working copy - just a temporary file. Similarly the shared repositories can be in a temp directory and the index is a temporary file too. At no point do we actually try to checkout a repo and if a user wants to make their own repo difficult by sticking things in .git/ they can (You can't commit any path with a .git in it from the git cli).


Finally got round to looking at the source code for git -- verify_path which is run on adding to index disallows the committing of any .git directories (https://github.com/git/git/blob/7f4e64169352e03476b0ea64e7e2973669e491a2/read-cache.c#L951) although I haven't checked completely I suspect a malicious repo that contained a tree with a ".git" directory would be similarly protected by verify_path.

@zeripath Btw, I had a failed attempt following in this line of thinking https://github.com/go-gitea/gitea/pull/641.
Maybe it's actually feasible given sufficient knowledge of the git plumbing and internals.

I think your current approach should reduce the load considerably. Certainly the use of shared object databases is extremely important and would be responsible for improvements in load. It would be worth a pr in itself. In fact, if you don't think #4921 is going to be ready in time for 1.7 I'd recommend doing exactly that. (It's worth noting that most repos have many more objects than checkout files and although hard links are relatively cheap they still need to be created and still change inodes.)

Just thinking on, although you've shared the base repository object database, when you add the remote that doesn't get shared too. It is possible to set up a repo to share multiple object databases. Could it be that one of these repos could have a much smaller object db, so when you fetch on the "remote" you actually copy a lot of objects unintentionally? This is likely to be an extreme edge case though.

Although libgit2 would likely provide an improvement - it still requires in depth understanding of the plumbing and most of GitHubs improvements come from that deeper knowledge of plumbing. Yes they saw benefits descriptifying in the way that libgit2 allows them to do - we still need to move off the porcelain and have a deeper understanding of the plumbing before we can go do that approach. Your current PR is a definite step towards deeper plumbing knowledge. I'm working on file upload at present and have a working - albeit not very pretty - solution for this.

It's also worth noting that the unique thing that libgit2 provides is the in memory index - the merging strategies come from this, but as a default index is a file it may be that we don't need this.

Certainly the use of shared object databases is extremely important and would be responsible for improvements in load.

We have deployed that bit on our server couple of months ago and it makes the difference between totally unusable (ie. the merge times out) and slow, but usuable.

The Gitea performance on Windows with our multi-gigabyte repositories is atrocious out of the box, but we run a forked version that makes it quite bearable. It uses go-git for accessing the repositories and often makes the pages load > 10x faster (in some cases > 30x). I was not brave enough to use go-git for any write operations, especially since the merging algorithm is quite complex, but using the shared object database in itself saved ~30 seconds on each merge on 3+ Gb repositories.

@filipnavara why no to share your changes to us back as I have also started to move parts of the read operations to go-git

@lafriks I shared the links in other issues/PRs. https://github.com/filipnavara/gitea/commits/commitgraph (also the perf-read branch). I will rebase it when I get back after new year.

Nice, will take a look again. Did perf improvement got merged in go-git?

Yep, go-git has merged my performance fixes.

The branch with serialized commit tree is stale at the moment, but it is not that important anyway. We use it on our server, but the performance improvements are relatively small compared to the rest of the changes.

Just thinking on, although you've shared the base repository object database, when you add the remote that doesn't get shared too. It is possible to set up a repo to share multiple object databases. Could it be that one of these repos could have a much smaller object db, so when you fetch on the "remote" you actually copy a lot of objects unintentionally? This is likely to be an extreme edge case though.

@zeripath Adding the path of the head repository to .git/objects/info/alternates of the staging repository can probably address the problem of redundant git-fetch.

there is a bounty on this issue: https://www.bountysource.com/issues/40674560-merging-prs-takes-very-long-time-for-large-repository

It's closed. It's from 2017. 馃槃

it can be clamed ... that's what I like to remind

Was this page helpful?
0 / 5 - 0 ratings