Caseflow: Engineering topics for discussion

Created on 26 Apr 2019  路  17Comments  路  Source: department-of-veterans-affairs/caseflow

A running agenda of things to discuss:

  • [x] FACOLS
  • [x] Guidelines for when it is okay to ignore GitHub's pre-merge status checks
  • [x] Code Climate
  • [x] Kibana space recycling automation (https://dsva.slack.com/archives/C2ZAMLK88/p1556580882067200)
  • [x] Coding standards (design patterns, naming, etc.)
  • [x] Property-based testing
  • [x] How to make transitioning to unfamiliar parts of the codebase easier
Tech-Improvement

All 17 comments

  • [x] Bat Team discussion

    • Now that we've all been through it once, what are the teams overall thoughts?

    • Tabled for a discussion post recompete

  • [x] [Managing Caseflow's multiple repositories](https://dsva.slack.com/archives/C2ZAMLK88/p1563192663481400)

    • What are the pain points?

    • Can we consolidate any of these repos to make our lives easier?

  • [x] ZenHub/GitHub practices

    • Assign yourself to your PRs

    • Attach screenshot/GIF when appropriate

    • In ZH, tickets in Ready for Dev pipeline may need estimation.

  • [x] Testing best practices

    • Things I noticed while working on speeding up the test suite that we could be doing going forward to ensure faster tests.

I'm happy to do a WW instead if the huddle is not the place for this.

  • [x] Code reviews - what are the reviewer's responsibilities?
  • [x] Obtaining dev environment DB images w/o needing access to official AWS (due to delays in obtaining access)
  • [x] Discuss performance of the application.

    • How do we know if the app is performing poorly?

    • How do we know which parts of the app are slow?

    • Adding performance tests to our automated tests (#11711)

To add on to Lowell's questions:

  • [ ] How do we ensure each new commit does not decrease performance?
  • [x] Discuss upgrades to frontend (runtime)
  • [x] Discuss possible upgrades to frontend (build tooling)
  • [x] Sentry alerts: they are not slowing down. Can we prevent them by fixing the code?
  • [x] Should we clean VACOLS and postgres between test runs by default

    • Regularly causes test failures because our human-intensive solution for identifying when we need to clean databases leaks state.

    • The runtime of our rspec builds in CircleCI has not decreased meaningfully as a result of these changes. Initial discussion expected reduction in runtime from 18 minutes to ~6.5 minutes. Runtimes on CircleCI continue to be around 17 minutes.

    • I propose rolling back PR #11470

Created ticket to do this work: https://github.com/department-of-veterans-affairs/caseflow/issues/12725

@lowellrex I saw your comment pop up in my email. What a coincidence! I recently finished a blog post about the work I did to speed up the test suite (not yet published). I spent time measuring the time savings in Circle CI by taking the average of the longest RSpec runs for 5 builds before and after my changes, and the difference was 0.65 minutes. I also looked at the average number of builds over the past week, which was around 50. The highest number of builds I counted on a particular day was 83. Using an average of 50 builds per day, and 0.65 minutes saved per build, that's 32.5 minutes saved every day, which I would consider meaningful. Over a period of 1 year, that's 3 work weeks saved.

This doesn't take into account the time saved locally when running tests. Every engineer saves 5 seconds every time they run a single test that doesn't use VACOLS. In addition, every new test that is added that doesn't use VACOLS will run faster than it would have. If this change is reverted, the test suite will get slower over time at a faster rate, which can be evidenced by the speed difference observed locally.

Some questions I would ask:

  • Exactly how often do these failures happen, that are for sure due to missing :postgres/:all_dbs tags, and/or require statements?
  • Are the failures due to existing specs or new ones?
  • What is the most common issue?

    • Specs are not tagged at all with either :postgres or :all_dbs

    • The wrong tag was used

    • Missing require statements

    • The wrong require statement

  • Who is not properly tagging specs?

    • Mostly new team members?

    • An even mix of team members?

    • Mostly seasoned members?

    • Mostly the same folks?

  • Is there a way to rewrite the specs that didn't clean the DB such that they don't need one or both DBs at all?

One suggestion I would make: when writing a new test or set of tests in a file, run them twice in a row. That is a reliable way I have found to identify DB cleaning issues. Looking at test.log also helps to see which DBs are used.

Another thing I would recommend looking into is Knapsack Pro to see how it compares to the current parallelization implementation. I was planning on trying it out, but didn't have time before I left. I hooked it up on login.gov and had a good experience with it. It's been improved since then, and they have a dynamic queue mode now that will optimize each node so that they all finish around the same time, as opposed to having one node take a minute longer than the rest, for example.

I notice some recurring themes. Personally, I've spent way too many hours babysitting CircleCI as I rerun 15-minute rspec tests to determine if failures are caused by me, a flakey test, or a new non-deterministically-failing test.

Some considerations:

  1. If developers are copy-and-pasting existing test files to create new tests, we should point them to well-written tests so as to promote performant and otherwise better tests.
  2. I notice this Testing Best Practices page. Can we automate some of these guidelines as checker or linting rules?
  3. Can we identify relevant problems with current tests (including pitfalls and desired spec file changes), present solutions to those problems to all the engineers, and make a concerted, time-boxed effort to make those changes?

@yoomlam I think it's fine to re-visit our approach to flakey tests as part of Huddle. Some things to be aware of:

  • If tests are failing on Circle CI that are not related to your PR, it's quite likely they are flakes. We track those in #10516 so that's the first place I go to check if it's a known flake or something new.
  • The Bat Team does work on flakes as time allows. See https://github.com/department-of-veterans-affairs/caseflow/wiki/Bat-Team#flakey-tests
  • If a flake is blocking your PR, mark it with skip and make sure it is logged on #10516
  • Drop a note in #appeals-engineering if you are going to work on fixing a flake, just to make sure others are aware (as someone else might also be working on a fix).

Closing this issue in favor of using a Google doc to track discussion

Was this page helpful?
0 / 5 - 0 ratings