Jobs ran correctly on May 3rd.
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [ReassignOldTasksJob] [f4f9c340-549c-48ce-b95e-49d3774b4b7b] [2017-05-03 00:01:15 -0400] Performed ReassignOldTasksJob from Sidekiq(default) in 6417.79ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [CreateEstablishClaimTasksJob] [ceb62eaa-dc01-4bf6-a49b-f000e74b4652] [2017-05-03 00:06:08 -0400] Performed CreateEstablishClaimTasksJob from Sidekiq(default) in 298659.39ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0e8db86f1a9470a87-172.30.82.225 [ActiveJob] [PrepareEstablishClaimTasksJob] [1852e12f-9072-4799-8d7e-939ffc7b7eba] [2017-05-03 00:33:09 -0400] Performed PrepareEstablishClaimTasksJob from Sidekiq(default) in 1828568.29ms
Only 2 jobs ran on May 4th.
Looks like CreateEstablishClaimTasksJob didn't run.
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0a2b83ca8cb366983-172.30.87.70 [ActiveJob] [ReassignOldTasksJob] [3b848b10-fc64-47a9-a313-971ec71d63bb] [2017-05-04 00:01:10 -0400] Performed ReassignOldTasksJob from Sidekiq(default) in 2802.13ms
dsva-appeals-certification_worker-prod/opt/caseflow-certification/src/log/caseflow-certification-worker.out i-0a2b83ca8cb366983-172.30.87.70 [ActiveJob] [PrepareEstablishClaimTasksJob] [b0db56cd-8f7e-45ac-a4eb-f59b11f1b37a] [2017-05-04 00:36:36 -0400] Performed PrepareEstablishClaimTasksJob from Sidekiq(default) in 2066693.7ms
I started the job manually. @joofsh thinks the missing :active_job setting might be the key to the issue. We will reassess after today's deploy.
Jobs ran correctly from May 7-10th.
Haven't seen any errors since we have made the active_job change. I wonder if it addressed the issue.
Closing this issue since we haven't observed any anomaly since.
Reopening this issue.
PrepareEstablishClaimTasksJob failed on the morning of 06/30/2017
Some food for thought:
PrepareEstablishClaimTasksJob is so "unstable" is that we're updating all appeals in the same 1 job. This is not a good use of Sidekiq, as it will stop if any 1 error occurs. The ways to solve this is either 1) Add lots of error handling logic to the job itself, or 2) Use sidekiq's built in retry logic & break this task into smaller tasks. In general with sidekiq, we want to make the tasks the smallest possible unit. So this "parent" job could instead trigger smaller jobs for each appeal to query VBMS for its documents. That way with 100 smaller jobs, if any 1 or 2 runs into VBMS connection issues, they'll naturally get re-queued by sidekiq and the other 98 appeals will still be ready to use in Dispatch. CC @aroltsch @askldjd
Reasons why a job might fail:
@joofsh and I had a discussion and came up with the following.
PrepareEstablishClaimTasksJob.devops-alerts channel “PrepareEstablishClaimTasksJob successfully ran. 10 tasks prepared. 3 tasks failed” OLD AC:
We have no clue when Sidekiq fails and why until everything is on fire.
Rails.logger is a terrible way of finding out at 4AM (after browsing through Cloudwatch logs) that Sidekiq job failed.
The timer is there:

but no webhook in dev_null
PASSED
WOOOOOOOOOO

Most helpful comment
@joofsh and I had a discussion and came up with the following.
PrepareEstablishClaimTasksJob.devops-alertschannel “PrepareEstablishClaimTasksJob successfully ran. 10 tasks prepared. 3 tasks failed”