Caseflow: Sidekiq | We need to loudly log/announce/report failures and successes.

Created on 2 May 2017 · 11Comments · Source: department-of-veterans-affairs/caseflow

This ticket will focus on PrepareEstablishClaimTasksJob.
If "preparation" of the task fails, we will catch VBMS error and continue with the next task.
We will keep track of the tasks that failed and log them to the log file.
We will have Slack integration to log to the devops-alerts channel “PrepareEstablishClaimTasksJob successfully ran. 10 tasks prepared. 3 tasks failed”
The PrepareEstablishClaimTasksJob will run at 5pm
CreateEstablishClaimTasksJob will run at 4:30pm

High caseflow-dispatch Tango 💃 Bug

Source

askldjd

Most helpful comment

@joofsh and I had a discussion and came up with the following.

This ticket will focus on PrepareEstablishClaimTasksJob.
If "preparation" of the task fails, we will catch VBMS error and continue with the next task.
We will keep track of the tasks that failed and log them to the log file.
We will have Slack integration to log to the devops-alerts channel “PrepareEstablishClaimTasksJob successfully ran. 10 tasks prepared. 3 tasks failed”
The job will run at 4pm

aroltsch on 18 Jul 2017

👍2

All 11 comments

Jobs ran correctly on May 3rd.

dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [ReassignOldTasksJob] [f4f9c340-549c-48ce-b95e-49d3774b4b7b] [2017-05-03 00:01:15 -0400] Performed ReassignOldTasksJob from Sidekiq(default) in 6417.79ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [CreateEstablishClaimTasksJob] [ceb62eaa-dc01-4bf6-a49b-f000e74b4652] [2017-05-03 00:06:08 -0400] Performed CreateEstablishClaimTasksJob from Sidekiq(default) in 298659.39ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [PrepareEstablishClaimTasksJob] [1852e12f-9072-4799-8d7e-939ffc7b7eba] [2017-05-03 00:33:09 -0400] Performed PrepareEstablishClaimTasksJob from Sidekiq(default) in 1828568.29ms

askldjd on 3 May 2017

Only 2 jobs ran on May 4th.

Looks like CreateEstablishClaimTasksJob didn't run.

dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0a2b83ca8cb366983-172.30.87.70 [ActiveJob] [ReassignOldTasksJob] [3b848b10-fc64-47a9-a313-971ec71d63bb] [2017-05-04 00:01:10 -0400] Performed ReassignOldTasksJob from Sidekiq(default) in 2802.13ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0a2b83ca8cb366983-172.30.87.70 [ActiveJob] [PrepareEstablishClaimTasksJob] [b0db56cd-8f7e-45ac-a4eb-f59b11f1b37a] [2017-05-04 00:36:36 -0400] Performed PrepareEstablishClaimTasksJob from Sidekiq(default) in 2066693.7ms

I started the job manually. @joofsh thinks the missing :active_job setting might be the key to the issue. We will reassess after today's deploy.

askldjd on 4 May 2017

Jobs ran correctly from May 7-10th.

Haven't seen any errors since we have made the active_job change. I wonder if it addressed the issue.

askldjd on 10 May 2017

Closing this issue since we haven't observed any anomaly since.

askldjd on 28 Jun 2017

Reopening this issue.

PrepareEstablishClaimTasksJob failed on the morning of 06/30/2017

joofsh on 30 Jun 2017

Some food for thought:

+1 Alan's idea of moving the job to 4pm.
+1 Adding slack integration directly to caseflow. Having that infrastructure in general will be good for many things going forward.
The reason the PrepareEstablishClaimTasksJob is so "unstable" is that we're updating all appeals in the same 1 job. This is not a good use of Sidekiq, as it will stop if any 1 error occurs. The ways to solve this is either 1) Add lots of error handling logic to the job itself, or 2) Use sidekiq's built in retry logic & break this task into smaller tasks. In general with sidekiq, we want to make the tasks the smallest possible unit. So this "parent" job could instead trigger smaller jobs for each appeal to query VBMS for its documents. That way with 100 smaller jobs, if any 1 or 2 runs into VBMS connection issues, they'll naturally get re-queued by sidekiq and the other 98 appeals will still be ready to use in Dispatch.

CC @aroltsch @askldjd

joofsh on 17 Jul 2017

Reasons why a job might fail:

VBMS doesn't respond
BGS responds with error
Vacols is dead

aroltsch on 18 Jul 2017

@joofsh and I had a discussion and came up with the following.

This ticket will focus on PrepareEstablishClaimTasksJob.
If "preparation" of the task fails, we will catch VBMS error and continue with the next task.
We will keep track of the tasks that failed and log them to the log file.
We will have Slack integration to log to the devops-alerts channel “PrepareEstablishClaimTasksJob successfully ran. 10 tasks prepared. 3 tasks failed”
The job will run at 4pm

aroltsch on 18 Jul 2017

👍2

OLD AC:

The Dispatch Problem:

We have no clue when Sidekiq fails and why until everything is on fire.
Rails.logger is a terrible way of finding out at 4AM (after browsing through Cloudwatch logs) that Sidekiq job failed.

Solutions:

Sidekiq announces to Slack that the job ran successfully or failed and what was the result
Sidekiq announces to the UI the real results of each job
Dispatch jobs moved to 4PM - according to @cmgiven that is a fine time to run the jobs and someone can be around to watch them
Sidekiq needs to keep on going if part of the job failed - not stop dead in its tracks
If we still going to kill the whole job, then we need to be able to retry intelligently.
If Sidekiq succeeds - we should still be notified, because monitoring is good. Can we make it visible through Sidekiq UI? Even just for basic stats... Log it somewhere special - not Cloudwatch. @lakohl needs to be able to access that easily