Caseflow: Enable back-end sorting of tasks by docket type and number

Created on 8 Jul 2019  路  9Comments  路  Source: department-of-veterans-affairs/caseflow

11247 established the pattern for how the pagination API will sort tasks for the queue on the back-end but did not enable sorting by all fields. This ticket exists to enable sorting by docket type and number through the pagination API that mirrors how we sort by that field on the front-end.

Acceptance criteria

  • [ ] Enable back-end sorting of tasks by docket type and number
generic-queue queue-table-view caseflow-queue Echo 馃惉

Most helpful comment

We're going to proceed with option 1 here. We will create a table devoted specifically to caching fields that can be sorted/filtered in the queue table view. We will write a job that populates this table on a regular basis (at least daily, maybe as often as hourly) from two requests, one each to VACOLS and Caseflow.

Next step on this ticket will be to write the portion of that job that will fetch docket number from VACOLS (docket type will always be "legacy") that associate with LegacyAppeal records. This should ideally be a single statement but if it's a series of fast-running statements, then that is fine as well. We will then run that statement live to see how performant it is and how it can be improved (do we need to ask Jed to add an index to some table?).

  • [x] Write statement to pull docket numbers for all LegacyAppeals from VACOLS.
  • [ ] Benchmark it.

After that we will write the same statement for the postgres Caseflow database, write the job, add the table, start populating that table, then join to this dedicated table and sort by these values in the pagination API.

All 9 comments

Quick sketch at how we can at least sort by docket type in a single statement (looker link):

select tasks.*
, case
    when tasks.appeal_type = 'LegacyAppeal' then 'legacy'
    else appeals.docket_type
  end as sorting_col
from tasks
left join appeals
  on appeals.id = tasks.appeal_id
  and tasks.appeal_type = 'Appeal'
order by sorting_col;

Including docket number in the sort requires a request to VACOLS so getting that information for LegacyAppeals may be a little more involved.

Problem

On the front-end we can sort by docket type and number because those values are fetched during the task (really appeal) serialization step and are available on the front-end. The pagination API aims to be as efficient as possible and therefore (probably) cannot afford to load ~5,000 objects into memory before sorting.

In order to sort by docket number in addition to docket type (sorting by docket type is already possible with the existing database structure) we need to compare values across database systems (folder.tinum in Oracle VACOLS and appeals.docket_number in Postgres Caseflow). Below are some possible ways we might accomplish that.

Solutions

  1. Copy data from Oracle into Postgres and update it when we access the record in VACOLS if it has fallen out of date. Docket number shouldn't change very often and incorrectly sorting a single task when it does is not a terrible failure mode. This has the advantage of only executing a single statement for single database system.

  2. Pull up-to-date docket numbers from one database into temporary join table in the second. Has the advantage of always being up-to-date with VACOLS but makes requests to both databases and holds a fairly sizable chunk of data (at least 2 values for ~850,000 rows) in memory for the duration of the second statement's execution.

  3. We could be a little clever. Since we are sorting by docket type then docket number and we know that the docket types will always sort: direct_review, evidence_submission, hearing, legacy we can union two sorted sets (the first three dockets will be in Caseflow, legacy will always be in VACOLS) and only make requests to both databases when a page of tasks will cross the union. This does not entirely solve the problem since we will need to sort the rows in VACOLS as well, but it does separate the sorting into two discrete problems.

  4. Do the sorting at the application layer. We could load all of the tasks from both databases and sort by task.appeal.docket_number. It is unclear to me how much slower this will be than the approaches that move the sorting to the database layer. We can avoid the serialization step which takes a lot of time, but we still hit both databases, store a lot of data in memory, and loop over a large set.

Are there other options we should consider? Are there strong reasons for or against any of the solutions listed above?

We're going to proceed with option 1 here. We will create a table devoted specifically to caching fields that can be sorted/filtered in the queue table view. We will write a job that populates this table on a regular basis (at least daily, maybe as often as hourly) from two requests, one each to VACOLS and Caseflow.

Next step on this ticket will be to write the portion of that job that will fetch docket number from VACOLS (docket type will always be "legacy") that associate with LegacyAppeal records. This should ideally be a single statement but if it's a series of fast-running statements, then that is fine as well. We will then run that statement live to see how performant it is and how it can be improved (do we need to ask Jed to add an index to some table?).

  • [x] Write statement to pull docket numbers for all LegacyAppeals from VACOLS.
  • [ ] Benchmark it.

After that we will write the same statement for the postgres Caseflow database, write the job, add the table, start populating that table, then join to this dedicated table and sort by these values in the pagination API.

Paired this afternoon to verify that this approach would work and made some progress. Collecting info from Caseflow DB should work just, still digging into how long it takes for VACOLS request to finish:

# VACOLS::Case (vacols.brieff)
# - Primary Key: bfkey (same as legacy_appeals.vacols_id)

# VACOLS::Folder (vacols.folder)
# - Primary Key: ticknum (same as legacy_appeals.vacols_id and brieff.bfkey)
# - Docket Number: tinum

# Is this an expensive query to run?
# Let's check it out! Get only 5 records:
rails c> VACOLS::Folder.limit(5).pluck(:ticknum, :tinum)
# 5219.3ms for initial overhead to enumerate VACOLS database structure for
# creating ActiveRecord objects.
# Actual query ran in 106.1ms, 83.1ms, 83.1ms, 83.3ms, 83.7ms over 5 trials

# Let's up the ante buds! 50 records!
rails c> VACOLS::Folder.limit(50).pluck(:ticknum, :tinum)
# 84.1ms, 82.3ms, 82.9ms, 82.1ms, 84.0ms

# 500 this time
rails c> VACOLS::Folder.limit(500).pluck(:ticknum, :tinum)
# 407.5ms, 407.7ms, 408.6ms, 408.8ms, 410.1ms

# 5,000:
rails c> VACOLS::Folder.limit(5000).pluck(:ticknum, :tinum)
# 4058.0ms. That's slow.

# 50,000:
rails c> VACOLS::Folder.limit(50000).pluck(:ticknum, :tinum)
# 42130.3ms. Very slow. Appears to be scaling linearly which is odd.

# Let's take the training wheels off!
# Maybe limits are slow in Oracle databases?
rails c> VACOLS::Folder.pluck(:ticknum, :tinum)
# Killed after 106177.7ms (106 seconds)... Interesting. Is this really slowness on
# the DB side of things? Are sequential table scans really that slow? Or is Rails
# trying to do some jui-jitsu that slows this down? Can we ask Jed?
# TODO: Come back to this.

# Let's explore the Caseflow/Postgres side of this equation.
# Docket number is a combination of receipt_date and ID, so let's pull those fields.
# Later we can explore if we want to offload this to the database.
rails c> Appeal.limit(5).pluck(:id, :receipt_date)
# 1.5ms

# Let's go big!
rails c> Appeal.limit(500).pluck(:id, :receipt_date)
# 1.6ms

# BIGGER!
rails c> Appeal.limit(50000).pluck(:id, :receipt_date)
# 10.9ms

# All 'em!
rails c> Appeal.pluck(:id, :receipt_date)
# 7.4ms

The session times out before VACOLS::Folder.pluck(:ticknum, :tinum) ever quits running.

Even just doing a count is super slow:

(1932.5ms)  SELECT COUNT(*) FROM "FOLDER"
=> 3458835

Interesting @ferristseng! I did not realize this table had 3.5 million rows! This is significantly more than the ~850,000 LegacyAppeals we have in Caseflow.

I just played around with a slightly different approach where we only pull a specific set of rows from VACOLS in order to reduce the load and it ran much faster!

rails c> legacy_appeal_ids = Task.open.where(appeal_type: LegacyAppeal.name).pluck(:appeal_id).uniq
rails c> legacy_appeal_ids.count
# 112986
rails c> VACOLS::Folder.where(ticknum: legacy_appeal_ids.first(1000)).pluck(:ticknum, :tinum)
# (480.2ms)

We loaded 1,000 rows in less than half a second! At that pace we can pull all ~100k rows we want in less than a minute! Maybe something like the following:

legacy_appeal_vacols_ids = LegacyAppeal.where(
    id: Task.open.where(appeal_type: LegacyAppeal.name).pluck(:appeal_id).uniq
  ).pluck(:vacols_id)
  docket_nums_for_appeals = legacy_appeal_vacols_ids.in_groups_of(1000, false).map do |ids|
    VACOLS::Folder.where(ticknum: ids).pluck(:ticknum, :tinum)
  end.flatten(1)

Did some benchmarking and the job to populate this table should take ~5 minutes to run.

  rails c> appeals_to_cache = Task.open.where(appeal_type: Appeal.name).map(&:appeal).map do |appeal|
      {
        appeal_id: appeal.id,
        docket_type: appeal.docket_type,
        docket_number: appeal.docket_number,
        appeal_type: Appeal.name
      }
    end; 0
- TODO: Why are we selecting each appeal individually?
  * Does it matter?
  * Started at 09:25:45 -0400
  * Finished at ...
  * Killed this because we were printing SQL statements to the console forever (kiled it after watching it run for ~10 minutes and output ~15 seconds of statements)
- If this actually becomes a bottleneck, we can move it into a direct SQL statement using something like this: https://caseflow-looker.va.gov/sql/swqh58cbyfkj2n

- Let's turn output off on this and see how long it takes to run
  rails c> ActiveRecord::Base.logger.level = 1
  rails c> Time.zone.now
  # Tue, 23 Jul 2019 13:37:18 UTC +00:00
  rails c> appeals_to_cache = Task.open.where(appeal_type: Appeal.name).map(&:appeal).map do |appeal|
      {
        appeal_id: appeal.id,
        docket_type: appeal.docket_type,
        docket_number: appeal.docket_number,
        appeal_type: Appeal.name
      }
    end; 0
  # NoMethodError (undefined method `id' for nil:NilClass)
  rails c> Time.zone.now
  # Tue, 23 Jul 2019 13:38:34 UTC +00:00
  # Where did this appeals go? https://dsva.slack.com/archives/C2ZAMLK88/p1563889466021200

- Let's cancel the existing RootTasks that are messing up this query and run it again.
  rails c> task_ids = [165028, 140914, 146144, 169782, 32702, 148861, 125264, 166369, 145579, 122859, 162107, 170673, 146176]
  rails c> RootTask.find(task_ids).pluck(:instructions)
  # [[], [], [], [], [], [], [], [], [], [], [], [], []]
  rails c> RootTask.where(id: task_ids).update_all(status: Constants.TASK_STATUSES.cancelled, instructions: ["Cancelling RootTask because associated Appeal no longer exists in database - Lowell Wood, 23 July 2019"])

- Let's try again.
  rails c> Time.zone.now
  # Tue, 23 Jul 2019 13:53:47 UTC +00:00
  rails c> appeals_to_cache = Task.open.where(appeal_type: Appeal.name).map(&:appeal).map do |appeal|
      {
        appeal_id: appeal.id,
        docket_type: appeal.docket_type,
        docket_number: appeal.docket_number,
        appeal_type: Appeal.name
      }
    end; 0
  rails c> Time.zone.now
  # Tue, 23 Jul 2019 13:55:12 UTC +00:00
  # I was slow on the draw here, so this took a MAX of ~90 seconds. Probably around
  # a minute.
  # Fast enough.
  # Let's spot check a few of these and see how they look
  rails c> pp appeals_to_cache.last(8)
  # Luckily stumbled upon a portion of the results that contains entries for the same
  # appeal (because there are two active tasks for the appeal). Let's fix that with uniq()
  # and add a test (TODO).

- Now let's see how quickly the VACOLS pull runs.
  rails c> ActiveRecord::Base.logger.level = 1
  rails c> Time.zone.now
  # Tue, 23 Jul 2019 14:04:40 UTC +00:00
  rails c> legacy_appeals = LegacyAppeal.find(Task.open.where(appeal_type: LegacyAppeal.name).pluck(:appeal_id).uniq); 0
  # Tue, 23 Jul 2019 14:04:48 UTC +00:00
  # That step is speedy. Love to see that. 8 seconds. Fantastic.
  # Next step. Let's see how long it takes this map() to run.
  # Data from postgres here. Should be around the same amount of time as the fetch for
  # AMA appeals information (1 minute).
  rails c> Time.zone.now
  rails c> values_to_cache = legacy_appeals.map do |appeal|
      {
        appeal_id: appeal.id,
        appeal_type: LegacyAppeal.name,
        vacols_id: appeal.vacols_id,
        docket_type: appeal.docket_name # "legacy"
      }
    end; 0
  # Instantaneous. Did it actually run?
  rails c> pp values_to_cache.first(3)
  # It worked! Great!

  # Now the call into VACOLS. Expect this to take a WHILE.
  # Let's turn logging back on for this to see what it does.
  rails c> ActiveRecord::Base.logger.level = 0
  # Let's set our BATCH SIZE.
  rails c> BATCH_SIZE = 1000
  rails c> Time.zone.now
  # Tue, 23 Jul 2019 14:10:09 UTC +00:00
  rails c> values_to_cache = legacy_appeals.pluck(:vacols_id).in_groups_of(BATCH_SIZE).map do |vacols_ids|
      VACOLS::Folder.where(ticknum: vacols_ids).pluck(:ticknum, :tinum).map do |vacols_folder|
        { vacols_id: vacols_folder[0], docket_number: vacols_folder[1] }
      end
    end
  # Each sub-select of 1k rows is taking roughly 1 second
  # Should take roughly 3 minutes for ~120k legacy appeals.
  # Finished at 2019-07-23 10:13:36 -0400
  # Started at  2019-07-23 10:10:38 -0400
  # Right on 3 minutes! Fine!

This is done!

Was this page helpful?
0 / 5 - 0 ratings