[x]
):We are currently migrating all of our 230 repositories to Gitea. Here's how we do it:
To keep it simple our Gitea instance has just one organisation with around 110 "Owner" team members. When we migrate repositories with more than 50 tags to that organisation, the git client pushes everything without an error, but Gitea fails when creating releases for every tag with the following message:
2019/10/16 11:12:42 models/webhook.go:909:DeliverHooks() [E] Get repository [235] hook tasks: dial tcp 172.xxx.xxx.xxx:3306: connect: cannot assign requested address
In the Gitea logfile I can see that Gitea calls repo_permission.go
for every tag and every user of that organisation, which results in a lot of calls and presumably database queries:
2019/10/16 11:12:39 models/update.go:80:PushUpdate() [T] TriggerTask 'TAGNAME' by gituser
2019/10/16 11:12:39 models/pull.go:1088:AddTestPullRequestTask() [T] AddTestPullRequestTask [head_repo_id: xxx, head_branch: xxx]: finding pull requests
2019/10/16 11:12:39 modules/auth/auth.go:97:CheckOAuthAccessToken() [T] ParseOAuth2Token: signature is invalid
2019/10/16 11:12:39 ...s/context/context.go:328:func1() [D] Session ID: xxx
2019/10/16 11:12:39 ...s/context/context.go:329:func1() [D] CSRF Token: xxx
2019/10/16 11:12:39 models/pull.go:1128:AddTestPullRequestTask() [T] AddTestPullRequestTask [base_repo_id: xxx, base_branch: xxx]: finding pull requests
2019/10/16 11:12:40 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824725727744:USER1 in 824725727824:ORG/reponame
2019/10/16 11:12:40 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824729407712:USER2 in 824729407808:ORG/reponame
2019/10/16 11:12:40 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824729900800:USER3 in 824729900880:ORG/reponame
[...]
When migrating that same repository to a "standalone" user account, everything works as expected. All releases get created. In the Gitea logfile I can see that gitea doesn't call repo_permission.go
, in this case:
2019/10/16 12:18:27 models/update.go:80:PushUpdate() [T] TriggerTask 'TAG1' by gituser
2019/10/16 12:18:27 models/pull.go:1088:AddTestPullRequestTask() [T] AddTestPullRequestTask [head_repo_id: xxx, head_branch: xxx]: finding pull requests
2019/10/16 12:18:27 modules/auth/auth.go:97:CheckOAuthAccessToken() [T] ParseOAuth2Token: signature is invalid
2019/10/16 12:18:27 ...s/context/context.go:328:func1() [D] Session ID: xxx
2019/10/16 12:18:27 ...s/context/context.go:329:func1() [D] CSRF Token: xxx
2019/10/16 12:18:27 models/pull.go:1128:AddTestPullRequestTask() [T] AddTestPullRequestTask [base_repo_id: xxx, base_branch: xxx]: finding pull requests
2019/10/16 12:18:28 models/webhook.go:898:DeliverHooks() [T] DeliverHooks [repo_id: xxx]
2019/10/16 12:18:28 models/update.go:80:PushUpdate() [T] TriggerTask 'TAG2' by gituser
[...]
We also tried to migrate a very large repository with around 15.000 commits and 7000 tags. Although we get an error from the git client, everything (tags, commits) gets transmitted and all releases get created by gitea when migrating to a user repository.
user@linux:~/repo.git$ git push --mirror https://gitea.example.org/ORG/repo.git
Counting objects: 409616, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (108168/108168), done.
Writing objects: 100% (409616/409616), 1.14 GiB | 25.52 MiB/s, done.
Total 409616 (delta 201291), reused 409611 (delta 201286)
error: RPC failed; curl 56 GnuTLS recv error (-110): The TLS connection was non-properly terminated.
fatal: The remote end hung up unexpectedly
error: error in sideband demultiplexer
Everything up-to-date
user@linux:~/repo.git$ git push --mirror https://gitea.example.org/ORG/repo.git
Everything up-to-date
Long story short, when we migrate repositories with more, or way more than 50 tags to a user account, everything works. When we migrate them to an organizazion with a lot of members, Gitea gets stuck when creating releases from tags.
As a workaround we first migrate to a user account and then transmit the ownership of the repository to the organisation.
This seems to be related to #8528 (#8273).
I'd recomment changing app.ini
and set:
[database]
MAX_IDLE_CONNS = 5
CONN_MAX_LIFE_TIME = 300
(I'd experiment with the values and see if they improve the results)
CONN_MAX_LIFE_TIME = 5m
Is easier to read.
Thanks you both for your quick response. I changed the database settings and tried migrating one of our biggest repositories (>6,000 tags, >14,000 commits) to our Gitea organisation. After creating 854 releases and around 100,000 repo_permission.go
calls I see the same error though:
2019/10/17 08:40:54 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824811255664:SOMEUSER in 824811255760:ORG/repo.git: ...
[...]
2019/10/17 09:06:58 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824662319392:SOMEUSER in 824662319520:ORG/repo.git: ...
2019/10/17 09:06:58 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824662792016:SOMEUSER in 824662792224:ORG/repo.git: ...
2019/10/17 09:06:58 ...ters/private/hook.go:165:HookPostReceive() [E] Failed to Update: ORG/repo.git Branch: BRANCHNAME Error: PushUpdate: CommitRepoAction: NotifyWatchers: insert new action: dial tcp 172.xxx.xxx.xxx:3306: connect: cannot assign requested address
I think this is an efficiency issue, as I don't see why there have to be so many repo_permission.go
calls when creating releases from tags. When I migrate the same repository to a user account an then move it to the organisation repo_permission.go
only runs one for the user transferring the repository.
@mansshardt Thanks for your reports. Could you share your database setting?
Sure! Here they are:
[database]
DB_TYPE = mysql
HOST = mariadb-host.example.org:3306
NAME = gitea
USER = gitea
PASSWD = super-secret-password
SSL_MODE = disable
LOG_SQL = False
MAX_IDLE_CONNS = 5
CONN_MAX_LIFE_TIME = 300
300
means 300
nano seconds, NOT 300
seconds, that's too short. Could you change that to 5m
and try agin.
Sorry, my fault about the nanoseconds.
My fault too. I should have checked.
No problem! I changed the setting to CONN_MAX_LIFE_TIME = 5m
and restarted Gitea. Sadly, this didn't make a difference. After creating 766 releases I still get:
2019/10/17 15:14:30 ...ters/private/hook.go:165:HookPostReceive() [E] Failed to Update: ORG/repo.git Branch: BRANCHNAME Error: PushUpdate: CommitRepoAction: NotifyWatchers: insert new action: dial tcp 172.xxx.xxx.xxx:3306: connect: cannot assign requested address
172.xxx.xxx.xxx:3306: connect: cannot assign requested address
This is most likely because of attempting too many connections to MariaDB, but I thought this would be solved by the new settings.
As TCP can only create and destroy <64515 socket pairs in a small lapse of time (TIME_WAIT), and this import operation seems to be depleting that, perhaps you can try using UNIX sockets for your connection to MariaDB instead?
I just pushed (--mirror) again and checked the connections to the database via netstat -an | grep ':3306' | wc -l
. During the creation of releases I have around 25,000 tcp sockets in TIME_WAIT. As mentioned before I'd guess that every permission call results in one database connection which doesn't get reused.
I don't understand why so many connections are nessecary in the first place. In my opinion fiddling with MariaDB properties or database settings to mitigate the issue shouldn't be a solution. I also can't use unix sockets as the database is not on the same machine.
@mansshardt I agree that this should not be the solution. I was only offering alternatives. 馃槃
For some reason MAX_IDLE_CONNS
didn't have any effect. AFAIU it should have had the effect of reusing the connections for most transactions. Especially after CONN_MAX_LIFE_TIME = 5m
.
You probably need my pr which allows you to set maxopenconns to prevent too many open connections to the db
@zeripath If I'm not mistaken, the problem in this issue is the number of _closed_ connections, not the open ones.
I agree that this should not be the solution. I was only offering alternatives. smile
For some reasonMAX_IDLE_CONNS
didn't have any effect. AFAIU it should have had the effect of reusing the connections for most transactions. Especially afterCONN_MAX_LIFE_TIME = 5m
.
@guillep2k Sure, and thanks again for your input.
The issue is not that there are to many open connections to the database as @guillep2k mentioned. The problem is, that during the creation of releases from tags, there is a huge amount of connections which get established and quickly closed. This doens't seem very efficient. As the system holds a tcp socket in TIME_WAIT for some time, we have a lot of these (~25.000) during the creation of releases from tags.
It seems that MAX_IDLE_CONNS
and CONN_MAX_LIFE_TIME = 5m
doesn't have any effect. I would expect this setting to build some kind of connection pool, which it doesn't.
I think two elements are relevant in this issue:
I've checked the docs and the sources and I couldn't find a reason why the connections are not pooled.
https://www.alexedwards.net/blog/configuring-sqldb
@zeripath It looks like your PR #8528 could be related after all, but the default value should be working nonetheless. In theory, by not calling SetMaxOpenConns()
we're effectively using MaxOpenConns == 0
, which should allow for any value in SetMaxIdleConns()
. But that somehow is not working.
@guillep2k yeah so with a long enough lifetime, a large enough MaxOpenConns and a MaxIdleConns near the MaxOpenConns should prevent the rapid opening and closing of connections preventing the port number depletion at least papering over this implementation fault.
We need to think about these permissions calls a bit better and consider if we can cache these results in some way.
In this particular case if we look at cmd/hook.go
:
https://github.com/go-gitea/gitea/blob/280f4bebbf5883eef94d8a84b21575393ae5e0e1/cmd/hook.go#L124-L186
Within the context of the hook we read each line of the provided stdin. We get one line per updated ref and they are of the form:
<old-value> SP <new-value> SP <ref-name> LF
This then gets translated to a GET request to the Gitea server calling:
This has the benefit of meaning each commit sha id is logged for free but if you're updating a lot of refs that means that you get a lot of separate HTTP requests.
Pre-receive has a similar architecture.
Now, that architecture means that even if we were doing this within a single session we wouldn't get much benefit from session caching - although it might have some benefit.
A better architecture would be to pass all of the refs in a POST, we could then create a repofiles.PushUpdates
which could have all the updated refs.
Unfortunately when I made these changes to the hooks I considered but dismissed the idea that anyone would be likely to send almost a thousand updates in one push so in terms of doing the least work I only made the simplest implementation.
A better architecture would be to pass all of the refs in a POST, we could then create a
repofiles.PushUpdates
which could have all the updated refs.
Although optimization is always a good thing, I think the root of the problem here is the connection pool. It will bite us back anytime, not only with the migration of large repositories.
Yeah, at the end of the day - it doesn't matter how efficient this bit of code is - if you have not configured the pool properly you could run it out of connections with the correct kind of load.
At least with #8528 we will expose all the configurables that go provides to the user - if that's still not enough then we'll have to think about writing our own pool. ( One which at the least could handle this error and wait)
If MaxOpenConns and MaxIdleConns are equal then there should be at most MSL * MaxOpenCons / MaxLifetime TimeWait connections. If you change MaxIdleConns to be different from MaxOpenConns you're likely to need to increase the maxlifetime but there will be point at which there is no stable solution.
Without setting MaxOpenConns a sufficient load _will_ cause port exhaustion.
@zeripath Mmm... I was about to write a long explanation of how MaxOpenConns should not affect the number of closed connections but now I think I see your point. The only way to avoid the system from creating time_wait entries is to keep it from closing them as much as possible, so MaxIdleConns should be equal to MaxOpenConns in this type of application where many users can be doing operations at the same time.
Again, your PR seems on point. What I wonder is: what's the strategy in the database driver for scheduling the connection requests when they are all busy? FIFO? Are there others available?
Without looking at the code I would guess it's actually "random" - the most obvious implementation is a simple spin lock with a wait until you actually get the real lock. I would bet there is no formal queue - too expensive - so we're into OS level queuing algorithms, suggesting a likely bias towards LIFO.
@mansshardt I know it's a bit of a hassle, but is it possible for you to grab @zeripath 's PR to build from source and try?:
[database]
; ...
MAX_OPEN_CONNS = 5
MAX_IDLE_CONNS = 5
CONN_MAX_LIFE_TIME = 5m
(It's important for the test that the first two values match)
@guillep2k I will try that on monday when I am back at the office and get back to you.
I just had the chance to test with a build from @zeripath PR and the following db settings:
[database]
; ...
MAX_OPEN_CONNS = 5
MAX_IDLE_CONNS = 5
CONN_MAX_LIFE_TIME = 5m
With this build and settings I can see a proper database pooling. During migration I have two or three tcp sockets in ESTABLISHED
state, which get used randomly. After five minutes the sockets go in TIME_WAIT and then get removed, as expected. With this build the migration of huge repos with a lot of tags works well, even though it's pretty slow. But all releases get created properly. I think #8602 should improve the speed issue. Hope to see both fixes/improvements in a official release soon.
@mansshardt would you be able to try #8602 it would be a great test of the code if it worked.
@mansshardt would you be able to try #8602 it would be a great test of the code if it worked.
@zeripath With my build from #8602 git doesn't timeout anymore, but just 190 of over 6000 tags got created as releases. Gitea didn't throw any error.
Damn that means that I have a bug...
OK @mansshardt I think that 200 is too large a batch size for gitea to process without the internal request timing out and that's why the you only get ~200 processed.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 weeks. Thank you for your contributions.
Most helpful comment
I just had the chance to test with a build from @zeripath PR and the following db settings:
With this build and settings I can see a proper database pooling. During migration I have two or three tcp sockets in
ESTABLISHED
state, which get used randomly. After five minutes the sockets go in TIME_WAIT and then get removed, as expected. With this build the migration of huge repos with a lot of tags works well, even though it's pretty slow. But all releases get created properly. I think #8602 should improve the speed issue. Hope to see both fixes/improvements in a official release soon.