Gitea: Prune hook_task Table

Created on 16 Mar 2020  路  19Comments  路  Source: go-gitea/gitea

  • Gitea version (or commit ref): NA
  • Git version: NA
  • Operating system: NA
  • Database (use [x]): NA

    • [ ] PostgreSQL

    • [x] MySQL

    • [ ] MSSQL

    • [ ] SQLite

  • Can you reproduce the bug at https://try.gitea.io:

    • [ ] Yes (provide example URL)

    • [ ] No

    • [x] Not relevant

  • Log gist:

Description

The hook_task table doesn't appear to be pruned at any point, so on an active site this table can grow to be very large and it makes loading the edit webhook page quite slow. I'm not sure how the action table relates but that may need to be pruned also?

Possible Solution

One solution I thought of is to allow administrators to set a max number of deliveries to retain per webhook.

Screenshots

MySQL [(none)]> SELECT 
    ->      table_schema as `Database`, 
    ->      table_name AS `Table`, 
    ->      round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` 
    -> FROM information_schema.TABLES 
    -> ORDER BY (data_length + index_length) DESC;
+--------------------+------------------------------------------------------+------------+
| Database           | Table                                                | Size in MB |
+--------------------+------------------------------------------------------+------------+
| gogs               | hook_task                                            |    1723.52 |
| gogs               | action                                               |     908.06 |
kinenhancement revieweconfirmed

All 19 comments

Action table can't really be pruned as it contains valuable activity information, while I agree webhooks could be cleared up by some policy

At first, we can add button on repository webhook management UI to delete old webhooks.

@lunny If I understand your suggestion, I don't think that solves the problem. That's only helpful if you have a handful of repositories, in which case you would never have this problem in the first place. Imagine having to click that button on 25,000 repositories!

The only realistic solution in that case is to have a global setting that the admin can control.

@jag3773 I also think there should be a button on admin panel but it's not conflicted with my idea. The button on admin panel will delete all the repositories in the gitea instance and for a public instance, we should let user chose whether to delete them.

Just posting a note I am looking into this issue, thanks!

This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.

A configuration item could be added to keep recent 1 year(or longer) webhooks. A background go routine could clean them every day.

@lunny thanks, I had submitted a PR for an implementation based on deleting all but x webhooks (can be set per repo). Do you have any thoughts there...that was the implementation that worked best for our use case but I can understand if you'd rather have something a little different for general usage.

Then we could have two choices. One is to delete old finished records, another is to keep some recent unfinished records.

Sounds like you are both saying similar things. My request is to limit the number of items kept in the history, whether it is date based or count based is not so important.

The advantage of "count based" is that every repo will retain a certain number of recent events. In a date based system, it's likely you would not see any history for a repo that hasn't been used recently.

I guess @lunny you are saying we should give the user the option to either purge by "older than x days" or an option like we had suggested "delete all but most recent x entries"? Which should be the default option?

@bhalbright Yes. That's what I meant.

Action table can't really be pruned as it contains valuable activity information, while I agree webhooks could be cleared up by some policy

@lunny @lafriks If I see this correctly, the action table could also be pruned. In my opinion the only function of the action table is to provide data for the dashboard. The timespan for the dashboard is currently 1 year, so data older than one year could be pruned. Our action table is about 1 GB in size after 1 year usage of gitea, which makes up 90% of the total database size.

So it would be perfect if someone could please implement the following features:

  • The timespan displayed of the dashboard should be configurable, so not always a full year has to be displayed (dashboard_display_period).
  • The timespan for historization of the action table should be configurable (action_historization_period (default is dashboard_display_period)).
  • Pruning of the action table should be implemented by a gitea cron task in dependence of the configured value for action_historization_period.

I'm sorry I can't support coding, but I don't speak go.

I think we could also have a button on admin panel to clean the two tables.

Yes, would be a first step.

So partially this is a problem of database management. When you get into large enough systems those running Gitea are going to need to actually do some DB management themselves and not rely on the ORM creating a perfect DB.

For example here if you were using a postgres 10+ DB backend you could simply PARTITION the action table etc. Similarly for other DB systems.

Throwing away action data is a decision for server managers - I'm not sure that we at Gitea should be running anything that deletes data by default.

However, we can and should do a few things here.

  • I agree putting in some configurabilty about dashboard length/feed length is required.
  • We could provide a cron task that would delete old actions etc. BUT they absolutely cannot run by default

Another option is whether we can store these actions on disk as a sort of hybrid db - however, we are then getting in to the situation of effectively being a DBMS - it's the job of the DB to decide how to partition and manage big tables. I'm not certain how GH or Facebook manage their big tables - I know some people advocate throwing this sort of stuff into a non-relational/NoSQL db depending on the inherent structure within these - and given we mostly don't use the relational features of this data that could work.

One thing we should additionally look at is why this data is updated and not immutable - if it can be made immutable then the hybrid approach may make more sense.

@zeripath regarding the cron to delete old actions being OFF by default, would you expect the same for a cron job to delete from hook_task? In the PR I had submitted it was turned on by default globally and then you could turn on/off by repo in the UI.

@zeripath

For example here if you were using a postgres 10+ DB backend you could simply PARTITION the action table etc. Similarly for other DB systems.

You are right, that would be simple and effective, but I would not recommend this approach to Gitea admins, because then you have another DDL as the one delivered with Gitea.

However, we can and should do a few things here.

  • I agree putting in some configurabilty about dashboard length/feed length is required.
  • We could provide a cron task that would delete old actions etc. BUT they absolutely cannot run by default

That's great and I appreciate that.

Another option is whether we can store these actions on disk as a sort of hybrid db - however, we are then getting in to the situation of effectively being a DBMS - it's the job of the DB to decide how to partition and manage big tables. I'm not certain how GH or Facebook manage their big tables - I know some people advocate throwing this sort of stuff into a non-relational/NoSQL db depending on the inherent structure within these - and given we mostly don't use the relational features of this data that could work.

IMHO I would advice you not to use plain text files and create a hybrid db of your own. Plain text files are a mess and in a container environment they are a mess and pain. I think the preasure on this topic isn't so high, so making the things configurable paired with a few simple db jobs inside Gitea as you recommended would be sufficient.

One thing we should additionally look at is why this data is updated and not immutable - if it can be made immutable then the hybrid approach may make more sense.

Which data is updated? As far as I know, the data in both the action table and the hook_task table are immutable and are not updated. But as I mentioned before I would not go with an hybrid approach, this introduces unneeded complexity.

Thank you for your efforts.

@bhalbright in regards to the hook_task table - in some ways that could be argued as being just user visible logging. Actions though are the unique behaviours of the users - it's a bit more than logging.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

internalfx picture internalfx  路  3Comments

kifirkin picture kifirkin  路  3Comments

Fastidious picture Fastidious  路  3Comments

BRMateus2 picture BRMateus2  路  3Comments

BNolet picture BNolet  路  3Comments