Vector: Better dev/test flow management

Created on 19 May 2020  路  9Comments  路  Source: timberio/vector

Currently, we have a lot of places where we define our tasks for various purposes:

  • Makefile
  • docker-compose.yml
  • scripts/*
  • .github/workflows/*

Given that all those interleave in non-trivial ways and call each other, the level of complexity is close to insane. Which is not good for maintenance.

I propose we reapproach this by introducing custom tooling to allow us to

  • declaratively describe the tasks in a single location,
  • allow us to invoke them with the flexibility on the execution environment that we want (i.e. directly on the host or within docker or VM),
  • properly manage dependent services lifecycle (for both on-the-host execution and in-docker execution)
  • and generate the rest of the that things we can't get rid of (i.e. .github/workflows/*) based on the single source of truth.

Actually, it does not necessarily have to be custom tooling, but we definitely don't want to try to adopt the tools that a) require extra steps to fit as-is, like docker-compose and b) aren't really designed for the job, since they'll introduce way too much complexity.

Specifically speaking about the docker-compose: if we were to use docker-compose - to the very least we should generate the docker-compose.yml config based on the declarative configuration. I've used to have a CI/dev flow based on this setup, and it worked flawlessly, but basing the design around the docker-compose was a serious limitation in the long run, and eventually limited our ability to adopt new workflows. This is why I'd explicitly recommend having a custom tooling and task description format, or at least something better-suited for than docker-compose. Maybe bazel or sth.
Maybe a custom bash-based tool would work. Custom is better since we'd have more control over how we define things and how the system executes. The composition is the key thing that we want to think about here. A simple stack of scripts as we currently have doesn't work - it doesn't scale and we don't have a declarative configuration. It's possible to build a proper solution based on bash though, but I'd consider using a different language (maybe even rust).

The discussion at this issue may help to contribute to the RFC on our CI process.

rfc task tech debt

Most helpful comment

Yeah! This level of complexity is awful.

I was pondering a bit how do this! We have some particular notes we should be mindful of here.

  • We should support all 4 major OS's from the same build/test system.
  • We may wish to make compromises on the Windows testing since some services are a chore.
  • We should not place undo burden on the user to have some specific tooling. We should do our best to use common/normal tools.
  • We should allow users to use real or existing services to test, and provide these containers/mocks only when require

We're also constrained a bit by the matter of the Ruby toolchain required for the Documentation/Website generation scripts.

So there are many reasons for choosing a Docker-or-Docker-like style system for doing integration tests and building the website.

I really don't think a project of a handful of people has the space to maintain a custom tool we build ourselves.

I'd generally suggest something less radical:

  • Remove docker-compose entirely, call Docker directly in the makefile only for ruby/integration/md tests.

    • All Rust jobs happen on host.

  • Remove the scripts folder entirely
  • Single Makefile
  • Then, make the tests actually call Docker to spawn the containers themselves when the integration-containers feature is on.

All 9 comments

Yeah! This level of complexity is awful.

I was pondering a bit how do this! We have some particular notes we should be mindful of here.

  • We should support all 4 major OS's from the same build/test system.
  • We may wish to make compromises on the Windows testing since some services are a chore.
  • We should not place undo burden on the user to have some specific tooling. We should do our best to use common/normal tools.
  • We should allow users to use real or existing services to test, and provide these containers/mocks only when require

We're also constrained a bit by the matter of the Ruby toolchain required for the Documentation/Website generation scripts.

So there are many reasons for choosing a Docker-or-Docker-like style system for doing integration tests and building the website.

I really don't think a project of a handful of people has the space to maintain a custom tool we build ourselves.

I'd generally suggest something less radical:

  • Remove docker-compose entirely, call Docker directly in the makefile only for ruby/integration/md tests.

    • All Rust jobs happen on host.

  • Remove the scripts folder entirely
  • Single Makefile
  • Then, make the tests actually call Docker to spawn the containers themselves when the integration-containers feature is on.

I'm a fan of this. Given that we've gone around in circles with this, it would be prudent to put together a _simple_ RFC. This way we can build consensus and have a document to refer to as the basis for our decisions.

Been playing with this on the nix-env-test branch while I see if I can join the checkers and have some fairly clear ideas. Will write some today!

Found this, GHA caching setup example:

- name: Cache cargo registry
  uses: actions/cache@v1
  with:
    path: ~/.cargo/registry
    key: ${{ runner.os }}-cargo-registry-${{ hashFiles('**/Cargo.lock') }}
- name: Cache cargo index
  uses: actions/cache@v1
  with:
    path: ~/.cargo/git
    key: ${{ runner.os }}-cargo-index-${{ hashFiles('**/Cargo.lock') }}
- name: Cache cargo build
  uses: actions/cache@v1
  with:
    path: target
    key: ${{ runner.os }}-cargo-build-target-${{ hashFiles('**/Cargo.lock') }}

We might want to use this so further speed up CI. With docker images pulled rather than built at each job, there's only a few things left to fix.

GitHub Actions have small size limit, it's just 5GB, so probably caches will be evicted to often. Also, if I remember correctly sometimes restore cache can be slower than make new installation (macos? :thinking:).
https://github.com/actions/cache#cache-limits

Well, yeah, we'd be caching tens on gigabytes for every build, so the actions/cache isn't going to work for us.

@MOZGIII From my experiments the gains of storing/pulling a big cache like our target vs building it in debug mode is minimal. I did find caching the index/registry is worthwhile for sure!

https://github.com/EmbarkStudios/cargo-fetcher - might be useful to cache rust parts of the build

This is basically an epic / catchall issue. Can we consider it covered by #3126 and #2971 and close this up?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LucioFranco picture LucioFranco  路  3Comments

trK54Ylmz picture trK54Ylmz  路  3Comments

LucioFranco picture LucioFranco  路  3Comments

MOZGIII picture MOZGIII  路  3Comments

a-rodin picture a-rodin  路  3Comments