Caseflow: [Tech Spec] ETL infrastructure

Created on 4 Nov 2019  路  16Comments  路  Source: department-of-veterans-affairs/caseflow

Problem

Currently, the Caseflow PostgreSQL db is mirrored once a day, overnight, to a Redshift instance. The Redshift instance is used by Tableau to provide metrics/reporting to BVA. Zero transformation is attempted on the schema, so it is a Extract/Load process (as opposed to Extract/Transform/Load (ETL)).

In its highly normalized structure in Redshift, the schema is slow to query. It is optimized for data integrity, not performance. As a result, Tableau reports are slow to run.

Caseflow requires a true ETL process, which means injecting a layer of transformation logic into the data export flow.

Some data points:

Assumptions

  • Normalized raw data snapshots will be stored in the BVA Data Lake. The Data Warehouse model is for Tableau reporting.
  • A denormalized schema for transformed data would focus initially on a single table with one row per Appeal.
  • The growth rate of new AMA data is modest, and RDS could support a denormalized schema for several years.
  • There are two related but separate issues to consider:

    • how to transform the data?

    • where to load the transformed data?

  • ETL should benefit from as much automatic testing as possible.

Recommendation

How to transform the data?

Given the amount of business logic already implemented in the Caseflow Rails application models and the desire to keep the ETL logic in sync with schema changes to Caseflow, let's implement another data store directly in Rails. This would be similar to the architectural pattern where full-text search is paired with the normalized db.

Create a new directory in app/models/etl with an abstract base class that knows how to connect to the new data store, and one model per denormalized table.

Where to load the transformed data?

Since Redshift and RDS (PostgreSQL) can have identical schemas, this proposal recommends starting with RDS because it is (a) so much cheaper and (b) should perform just as well as Redshift for several years, till we accumulate much more data.

Pros

  • Leveraging existing infrastructure, business logic, engineering resources.
  • Defers the issue of where to store the transformed data (Redshift and RDS/PostgreSQL both addressable by Rails).

Cons

  • Caseflow monolith gets more monolithic.
Data Delta 馃敽

Most helpful comment

@pshahVA I have updated the summary to make this RDS recommendation more explicit.

All 16 comments

What's the advantage of using a different data store over something within the existing data store (like materialized views and a primary/replica model)?

What's the advantage of using a different data store over something within the existing data store

Offhand, performance and flexibility. We already have a replica, but it's a replica of the normalized data. What we need is denormalized data, similar to the output of https://github.com/department-of-veterans-affairs/caseflow/blob/master/app/sql/ama-cases.sql . The goal is to "flatten" the data.

We could experiment with something like views in PostgreSQL, it just becomes a question of where you put the code that manages the business logic. E.g. the status of an appeal. If we implement that code in SQL (as in the example above) as part of a view, then that would need to manage changes and tests for that logic separately than the rest of our model. Right now Rails treats PostgreSQL as a kind of "dumb" key/value store. Very few native constraints (FKs, stored procedures, etc). So the bias is toward managing everything in Ruby/Rails.

From a flexibility pov, keeping it as a separate store means it's possible to swap in Redshift (or some other MPP db) for PostgreSQL with just a few lines of config change, which is probable once the data gets to a certain size years from now.

Wouldnt this benefit from having an ETL service and maybe using another RDS?
My point is that having an even more monolithic app can become problematic and more task intensive.

There is a service called Glue that we can maybe start exploring https://aws.amazon.com/glue/

I've used Glue to Athena before for large datasets. Good suggestion @enriquemanuel
This is a pretty good read too: https://aws.amazon.com/blogs/big-data/orchestrating-an-etl-process-using-aws-step-functions-for-amazon-redshift/
I'm not a big fan of growing the caseflow ruby base.

It's possible to dump all the denormalized data as json or parquet and use Spark, AWS EMR, Athena, Redshift, Kinesis Data Firehose. The list is long. We should put @enriquemanuel Redshift tuning analysis on this page too.

  1. What are your decision metrics? Your writeup suggests you want to get something up and running quickly using existing DB and infrastructure that we know works. How does ease-of-maintenance, long-term cost, ease-of-use, and other metrics rank in comparison?

  2. It sounds like the input DBs will be ETL'd in batches daily, either in their entirety or queried for the latest changes. Do you want to consider a stream processing framework (e.g., Kafka or Kinesis), where changes to the input DB will get ETL'd and update the destination DB in near real-time? (I've heard great things about stream processing but don't have personal experience with it.)

@enriquemanuel @cabeaulac There are all kinds of data stores available to us. As I tried to outline in the assumptions section:

There are two related but separate issues to consider:

  • how to transform the data?
  • where to load the transformed data?

This tech spec is trying to focus on the mechanism of transforming the data as a separate question.

Enrique asks:

Wouldnt this benefit from having an ETL service and maybe using another RDS?

That is exactly what this tech spec recommends. The only question is: where should the ETL service live? Should it be a separate repo? If it's in a separate repo, how will it manage its schema, migrations, tests? How will it keep in sync with the Caseflow schema and business logic? How will it manage environments, configuration? How will it, for example, compute the status of an Appeal during the Transform step?

I am proposing we add the transformation code to the existing Caseflow repo because it makes it easier to answer all the questions above.

What are your decision metrics? Your writeup suggests you want to get something up and running quickly using existing DB and infrastructure that we know works. How does ease-of-maintenance, long-term cost, ease-of-use, and other metrics rank in comparison?

I am in favor of high ease of maintenance, low long-term cost, high ease-of-use. I think we hit all those things by staying consistent with the existing code base and leveraging our existing technologies. That makes context switching easier for engineers, and helps decrease the chance that a change to our production code will run afoul of the ETL process.

It sounds like the input DBs will be ETL'd in batches daily, either in their entirety or queried for the latest changes. Do you want to consider a stream processing framework (e.g., Kafka or Kinesis), where changes to the input DB will get ETL'd and update the destination DB in near real-time? (I've heard great things about stream processing but don't have personal experience with it.)

I am not sure there is a clear business need for a streaming pattern. This is modest sized data that changes slowly (entirely through a known, finite number of human users).

We can keep the input DBs in sync with ETL target via batch or SQS queues. By keeping the ETL code close the production code, we have the option of either.

Minor thing, the spec starts with the assumption of approximately 60k new appeal records annually, which I believe is a slight under count. An old legacy model VBA/the Board had circa 2017 assumed 160,000 new NODs a year with about half, 80k, becoming appeals that got certified to the Board. This was based on historical inputs.

As such, I think assuming in the range of 80k to 160k new records a year is more accurate. This is still such a small number of records that I do not believe it impacts any of the proposed solutions in this spec.

Just some questions and thoughts. I need to read the resources that are cited in this more, and I am not up to speed with all the cloud solutions we have available. I just wanted to summarize my understanding and clarify.

  • We will create a separate data store with a new reporting friendly schema and it will be written real time by the RAILS code.
  • Developers will have to implement logic to write to both databases , but we will find this easier since the developer is more intimate with the code and changes at the time of development, rather than explain it to a downstream ETL process that may be maintained by someone else. We can also utilize our automated testing framework to test etc.
    Question: I understand our reports are not performing in redshift and we believe its a mismatch of our data organization vs ideal use of redshift, hence this other data store. This new DB will change the data structure that is more favorable for reporting. My question is that with such a small volume of data, why would the performance be so bad? (I know you're done a lot of analysis, but I wanted to bring it up again). I thought these types of optimized structures were for when we reached the millions/billions of records. The ETL makes sense when we need to pull in more data from other sources and have a place where all the data is in place for easy access to reports. Just some thought to consider.

@pshahVA your summary is correct.

My question is that with such a small volume of data, why would the performance be so bad?

The primary answer is that the number of rows is modest, but the number of tables is large, because the data in Redshift is still normalized. JOINs in Redshift are slow, but normalized data means lots of JOINs in order to generate a report.

This new ETL infra will support denormalized data (fewer tables, with many more columns, data being "duplicated" across rows for performance reasons).

Thank you Peter,
This is probably a very naive question, but could we migrate to non redshift and maintain the same schema? Would that help with performance?

could we migrate to non redshift and maintain the same schema? Would that help with performance?

Yes. That is what I hope we do. RDS (PostgreSQL) and Redshift could have identical schemas, but RDS is 1/25th the cost and should perform just as well given the size of our data.

@pshahVA I have updated the summary to make this RDS recommendation more explicit.

I think doing the transformation step inside Caseflow makes a lot of sense for the reasons you mentioned.

A thought: does the data warehouse have to be postgres? If, like your post says, we're essentially creating a document object, could we use a DB more suited to that structure like Mongo? Then we wouldn't have to concern ourselves with migrations, and we'd handle validations only on the application layer during the transformation step. This would allow us to create an external service that retrieves transformed data through a REST interface.

does the data warehouse have to be postgres

No. It could be anything Tableau supports (including MongoDB). One of the reasons I've tried to delineate the transform from the storage is to break the link between them, mentally, and give us more flexibility.

I suggested RDS/PostgreSQL to start with only because it minimizes the tech stack overhead at this stage in development. It's a known product we already use and should work Well Enough. And it's familiar to the Tableau team.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lowellrex picture lowellrex  路  5Comments

laurjpeterson picture laurjpeterson  路  4Comments

laurjpeterson picture laurjpeterson  路  4Comments

amprokop picture amprokop  路  6Comments

araposo-tistatech picture araposo-tistatech  路  5Comments