Caseflow: Sidekiq | We need to loudly log/announce/report failures and successes.

Created on 2 May 2017  ·  11Comments  ·  Source: department-of-veterans-affairs/caseflow

  1. This ticket will focus on PrepareEstablishClaimTasksJob.
  2. If "preparation" of the task fails, we will catch VBMS error and continue with the next task.
  3. We will keep track of the tasks that failed and log them to the log file.
  4. We will have Slack integration to log to the devops-alerts channel “PrepareEstablishClaimTasksJob successfully ran. 10 tasks prepared. 3 tasks failed”
  5. The PrepareEstablishClaimTasksJob will run at 5pm
  6. CreateEstablishClaimTasksJob will run at 4:30pm
High caseflow-dispatch Tango 💃 Bug

Most helpful comment

@joofsh and I had a discussion and came up with the following.

  1. This ticket will focus on PrepareEstablishClaimTasksJob.
  2. If "preparation" of the task fails, we will catch VBMS error and continue with the next task.
  3. We will keep track of the tasks that failed and log them to the log file.
  4. We will have Slack integration to log to the devops-alerts channel “PrepareEstablishClaimTasksJob successfully ran. 10 tasks prepared. 3 tasks failed”
  5. The job will run at 4pm

All 11 comments

Jobs ran correctly on May 3rd.

dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [ReassignOldTasksJob] [f4f9c340-549c-48ce-b95e-49d3774b4b7b] [2017-05-03 00:01:15 -0400] Performed ReassignOldTasksJob from Sidekiq(default) in 6417.79ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [CreateEstablishClaimTasksJob] [ceb62eaa-dc01-4bf6-a49b-f000e74b4652] [2017-05-03 00:06:08 -0400] Performed CreateEstablishClaimTasksJob from Sidekiq(default) in 298659.39ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [PrepareEstablishClaimTasksJob] [1852e12f-9072-4799-8d7e-939ffc7b7eba] [2017-05-03 00:33:09 -0400] Performed PrepareEstablishClaimTasksJob from Sidekiq(default) in 1828568.29ms

Only 2 jobs ran on May 4th.

Looks like CreateEstablishClaimTasksJob didn't run.

dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0a2b83ca8cb366983-172.30.87.70 [ActiveJob] [ReassignOldTasksJob] [3b848b10-fc64-47a9-a313-971ec71d63bb] [2017-05-04 00:01:10 -0400] Performed ReassignOldTasksJob from Sidekiq(default) in 2802.13ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0a2b83ca8cb366983-172.30.87.70 [ActiveJob] [PrepareEstablishClaimTasksJob] [b0db56cd-8f7e-45ac-a4eb-f59b11f1b37a] [2017-05-04 00:36:36 -0400] Performed PrepareEstablishClaimTasksJob from Sidekiq(default) in 2066693.7ms

I started the job manually. @joofsh thinks the missing :active_job setting might be the key to the issue. We will reassess after today's deploy.

Jobs ran correctly from May 7-10th.

Haven't seen any errors since we have made the active_job change. I wonder if it addressed the issue.

Closing this issue since we haven't observed any anomaly since.

Reopening this issue.

PrepareEstablishClaimTasksJob failed on the morning of 06/30/2017

Some food for thought:

  • +1 Alan's idea of moving the job to 4pm.
  • +1 Adding slack integration directly to caseflow. Having that infrastructure in general will be good for many things going forward.
  • The reason the PrepareEstablishClaimTasksJob is so "unstable" is that we're updating all appeals in the same 1 job. This is not a good use of Sidekiq, as it will stop if any 1 error occurs. The ways to solve this is either 1) Add lots of error handling logic to the job itself, or 2) Use sidekiq's built in retry logic & break this task into smaller tasks. In general with sidekiq, we want to make the tasks the smallest possible unit. So this "parent" job could instead trigger smaller jobs for each appeal to query VBMS for its documents. That way with 100 smaller jobs, if any 1 or 2 runs into VBMS connection issues, they'll naturally get re-queued by sidekiq and the other 98 appeals will still be ready to use in Dispatch.

CC @aroltsch @askldjd

Reasons why a job might fail:

  1. VBMS doesn't respond
  2. BGS responds with error
  3. Vacols is dead

@joofsh and I had a discussion and came up with the following.

  1. This ticket will focus on PrepareEstablishClaimTasksJob.
  2. If "preparation" of the task fails, we will catch VBMS error and continue with the next task.
  3. We will keep track of the tasks that failed and log them to the log file.
  4. We will have Slack integration to log to the devops-alerts channel “PrepareEstablishClaimTasksJob successfully ran. 10 tasks prepared. 3 tasks failed”
  5. The job will run at 4pm

OLD AC:

The Dispatch Problem:

We have no clue when Sidekiq fails and why until everything is on fire.
Rails.logger is a terrible way of finding out at 4AM (after browsing through Cloudwatch logs) that Sidekiq job failed.

Solutions:

  • Sidekiq announces to Slack that the job ran successfully or failed and what was the result
  • Sidekiq announces to the UI the real results of each job
  • Dispatch jobs moved to 4PM - according to @cmgiven that is a fine time to run the jobs and someone can be around to watch them
  • Sidekiq needs to keep on going if part of the job failed - not stop dead in its tracks
  • If we still going to kill the whole job, then we need to be able to retry intelligently.
  • If Sidekiq succeeds - we should still be notified, because monitoring is good. Can we make it visible through Sidekiq UI? Even just for basic stats... Log it somewhere special - not Cloudwatch. @lakohl needs to be able to access that easily

The timer is there:
screen shot 2017-07-26 at 10 50 55 am
but no webhook in dev_null

PASSED

WOOOOOOOOOO
screen shot 2017-07-26 at 12 38 33 pm

Was this page helpful?
0 / 5 - 0 ratings

Related issues

laurjpeterson picture laurjpeterson  ·  5Comments

araposo-tistatech picture araposo-tistatech  ·  5Comments

lomky picture lomky  ·  3Comments

laurjpeterson picture laurjpeterson  ·  5Comments

hschallhorn picture hschallhorn  ·  5Comments