Harbor: Impossible to delete replication rule : "have pending/running/retrying status"

Created on 3 Mar 2019 · 20Comments · Source: goharbor/harbor

Hi !

Here is what I did :

1) Create endpoint
2) Create replication rules with this endpoint
3) Decommission this endpoint (uninstall harbor and remove DNS)
4) Remove associated replication rules to be able to remove endpoint

The thing is it is impossible for me to remove these rules associated with the dead endpoint. Every time I try to remove them it say "Rule X have pending/running/retrying status". So I try to stop all job for the rule through the "stop job" button of the webUI but here are the log in jobservice.log :

Mar 3 16:13:38 172.18.0.1 jobservice[2212]: 2019-03-03T09:13:38Z [ERROR] [handler.go:253]: Serve http request 'POST /api/v1/jobs/ab3f3f6327dce749c5449ec1' error: 404 {"code":10013,"message":"object is not found","details":"job 'ab3f3f6327dce749c5449ec1'"}
Mar 3 16:13:38 172.18.0.1 jobservice[2212]: 2019-03-03T09:13:38Z [ERROR] [handler.go:253]: Serve http request 'POST /api/v1/jobs/dc20e1e80c3ee9750dd80253' error: 404 {"code":10013,"message":"object is not found","details":"job 'dc20e1e80c3ee9750dd80253'"}
Mar 3 16:13:38 172.18.0.1 jobservice[2212]: 2019-03-03T09:13:38Z [ERROR] [handler.go:253]: Serve http request 'POST /api/v1/jobs/1410b7421705eb20c0ec1610' error: 404 {"code":10013,"message":"object is not found","details":"job '1410b7421705eb20c0ec1610'"}
Mar 3 16:13:38 172.18.0.1 jobservice[2212]: 2019-03-03T09:13:38Z [ERROR] [handler.go:253]: Serve http request 'POST /api/v1/jobs/e2d49baa86145448d17e741e' error: 404 {"code":10013,"message":"object is not found","details":"job 'e2d49baa86145448d17e741e'"}
Mar 3 16:13:38 172.18.0.1 jobservice[2212]: 2019-03-03T09:13:38Z [ERROR] [handler.go:253]: Serve http request 'POST /api/v1/jobs/1fd1304e3cccc3c4d4be2a80' error: 404 {"code":10013,"message":"object is not found","details":"job '1fd1304e3cccc3c4d4be2a80'"}
Mar 3 16:13:38 172.18.0.1 jobservice[2212]: 2019-03-03T09:13:38Z [ERROR] [handler.go:253]: Serve http request 'POST /api/v1/jobs/5a0f25c7137fe09c0eb0cc78' error: 404 {"code":10013,"message":"object is not found","details":"job '5a0f25c7137fe09c0eb0cc78'"}

Same result by trying via the various API call.

I also try to change the endpoint of these rule or to set them to manual in the hope that it will allow me to remove them the next day but no success.

It seems like the jobservice status is not aware of these dead job hanging but they are present in the database (I checked).

As I removed a lot of harbor endpoint associated to many replication rules I end up with dead endpoint/replication and i am not able to clean them.

If you need any other details, logs, etc let me know !

I found kind of the same issue here https://github.com/goharbor/harbor/issues/2897 but it say "disabling the rule is the "official" way to stop a job" and I do not really know how to "disable" a rule...

arereplication targe1.8.0

Source

guillaumelfv

Most helpful comment

The issue still exists in 1.8.1, I had to find task numbers "InProgress" from Harbor UI and delete from DB, After this, able to delete replication.

Workaround:

psql -U postgres

\c registry

\d replication_execution;

select * from replication_execution where id in (1695,1694,1693,1692,1691,1690,1689,1688,1733,1732);

update replication_execution set status = 'Succeed',total = '1', end_time = now() where id in (1695,1694,1693,1692,1691,1690,1689,1688,1733,1732);

After this, you should be able to delete the replication tag from UI.

Hope this helps.

deshab on 2 Aug 2019

👍5

All 20 comments

@guillaumelfv

We'll take a look at this issue and see if we can find a better way to cover the case you met.

steven-zou on 4 Mar 2019

@guillaumelfv

An ugly workaround approach to fix the issue: Directly mark the status of all the ”zombie“ tasks to "error" in the database and then retry the deletion operation.

steven-zou on 4 Mar 2019

How safe is it to do that on a live harbor instance ? I was actually thinking about dropping all tasks then all rules/endpoints in the database to redeploy clean rules. But it is an ugly workaround and not sure hot harbor is gonna behave when I start touching the database directly.

guillaumelfv on 4 Mar 2019

An ugly workaround approach to fix the issue: Directly mark the status of all the ”zombie“ tasks to "error" in the database and then retry the deletion operation.

@steven-zou

Did this and mark all job as "finished" directly into the postgresql database. It did not work and when I look at the "replication jobs" of the rules in the UI I still see dozen of pending jobs (but there is none in the database...)

guillaumelfv on 5 Mar 2019

@steven-zou also meet this problem and need to fix it as soon...Please give me a right way..

louyiping on 5 Mar 2019

And I have met this case many times and I resolve it by restart all the harbor instance.But this time restart is useless.

louyiping on 5 Mar 2019

I think harbor should enhance the ablity of replication as first rather than insignificant function.
As I know , many users have met problem with replication including push replication and delete replication.

louyiping on 5 Mar 2019

👍3

Can it be that the job stale pending/retrying are in the Redis ? I run HA harbor so we have a remote redis server which I never restarted yet. We upgraded harbor for several version with this redis server.

guillaumelfv on 7 Mar 2019

Update : try to move from my own redis server back to the redis container.

Did not change anything.

It seems every time I restart harbor instances all the job for replication policy got stuck, fail and then stale for ever with pending,running or retrying job. The only solution here is to create a new rules until the next restart and then it will we stale again until I create a new one...

Any one to investigate ?

I run harbor in HA with our own postgresql and redis instance

guillaumelfv on 12 Mar 2019

@guillaumelfv

Will fix it in 1.8

steven-zou on 11 Apr 2019

I have the same issue a long time ago (1.2.0)...

gunboe on 12 Apr 2019

Work done! Close issue.

PR: https://github.com/goharbor/harbor/pull/7452

Fix is delivered in 1.8

steven-zou on 29 Apr 2019

👍1

I'm using Harbor 1.8 and still got replications stuck on InProgress.
I can't delete the replication "the policy 19 has running executions, can not be deleted"
Can't stop the executions as well.

YakirShriker on 25 Jul 2019

The issue still exists in 1.8.1, I had to find task numbers "InProgress" from Harbor UI and delete from DB, After this, able to delete replication.

Workaround:

psql -U postgres

\c registry

\d replication_execution;

select * from replication_execution where id in (1695,1694,1693,1692,1691,1690,1689,1688,1733,1732);

update replication_execution set status = 'Succeed',total = '1', end_time = now() where id in (1695,1694,1693,1692,1691,1690,1689,1688,1733,1732);

After this, you should be able to delete the replication tag from UI.

Hope this helps.

deshab on 2 Aug 2019

👍5

@deshab， @YakirShriker

Let's track the issue you mentioned in this opening one: https://github.com/goharbor/harbor/issues/8202

steven-zou on 7 Aug 2019

I met this issue too.
Aug 21 13:59:40 172.18.0.1 jobservice[21084]: 2019-08-21T05:59:40Z [ERROR] [handler.go:253]: Serve http request 'POST /api/v1/jobs/3d278055d4a5a0c9eaf5a004' error: 404 {"code":10013,"message":"object is not found","details":"job '3d278055d4a5a0c9eaf5a004'"}
Aug 21 13:59:40 172.18.0.1 jobservice[21084]: 2019-08-21T05:59:40Z [ERROR] [handler.go:253]: Serve http request 'POST /api/v1/jobs/3caf7b30ecdb2b8a5cdd8f3a' error: 404 {"code":10013,"message":"object is not found","details":"job '3caf7b30ecdb2b8a5cdd8f3a'"}

I can not stop the task in Relications.
Does any one can help me？
My harbor is 1.7.0

HelenaZheng on 21 Aug 2019

@HelenaZheng If replicate from harbor A to harbor B，just restart job-service container in harbor A.

louyiping on 22 Aug 2019

You mean restat the follow container?

715ec1b8d125 goharbor/harbor-jobservice:v1.7.0

HelenaZheng on 22 Aug 2019

@HelenaZheng yes

YakirShriker on 22 Aug 2019

Cleaning Postgres replication_execution table jobs status was not enough. Deleting replication_execution table data was not enough. Restarting the whole Harbor was not enough. I had to stop harbor-jobservice container, access redis-cli and flush base 2 and start harbor-jobservice again. Check jobservice.log through the process to see if it is sending data to redis container.