Vector: Get rid of flaky tests

Created on 8 Jul 2020  路  20Comments  路  Source: timberio/vector

We have a number of flaky tests that are imposing a heavy cost on our dev and CI process. We should have a strategy to eliminate them. I propose the following:

  1. Identify the tests in question
  2. For each test, open an issue describing its purpose and then delete the test entirely, linking the deletion commit in the issue
  3. Triage the issues by cost/benefit to having the test based on the described purpose
  4. Rewrite tests that are worthwhile

The tests currently provide negative value, so deleting them gets us back to a clean state as quickly as possible. We can then add them back (written in a more reliable fashion) as deemed valuable. If they're not valuable enough to rewrite, that's fine too.

tests bug tech debt

All 20 comments

@lukesteensen @Hoverbear @fanatid @bruceg @ktff, could you list any tests that you are aware of and we can get to work on this? I'd like to compile a concrete list we can work against.

encodes_histogram_without_timestamp
encodes_histogram_without_timestamp.log

sources::journald::tests::filter_unit_works_correctly
https://github.com/timberio/vector/runs/855687407

@Hoverbear where did you see encodes_histogram_without_timestamp failing? It should only be failing in #2913 like that, which isn't merged.

topology::config::watcher::tests::multi_file_update on mac
https://github.com/timberio/vector/pull/3010/checks?check_run_id=857600333

@ktff You're right! Sorry!

topology::reload_tests::topology_reuse_old_port on windows
https://github.com/timberio/vector/pull/3010/checks?check_run_id=857742471

https://github.com/timberio/vector/pull/3017#issuecomment-656849980

test_max_size_resume
test_max_size_resume.log

I've seen this one before I think

@Hoverbear not suprising, they were addressed in #2862, but the race to the file mentioned in comment is quite persistent. (EDIT: It's something else)

test_udp_syslog
https://github.com/timberio/vector/pull/2955/checks?check_run_id=861560287

The PR changes syslog but not the udp part, and the test passes locally.

sinks::aws_s3::integration_tests::s3_waits_for_full_batch_or_timeout_before_sending

https://github.com/timberio/vector/pull/3099/checks?check_run_id=886641917
https://github.com/timberio/vector/runs/868843252

sources::socket::test::tcp_gracefull_shutdown

https://github.com/timberio/vector/runs/874679387

Seems to be failing because of OS errors, probably firewall.

@ktff thanks for working through these. How are we feeling about closing this issue given the above? Are there tests left that we need to remove?

Are there tests left that we need to remove?

Based on recent CI runs, these are the only flaky tests, so no.

It should be fine to close it. If another one popes up, it can be dealt with in a regular way.

Sounds good 馃憤

merge_and_fork

https://github.com/timberio/vector/runs/928235975

Either all messages pass, or none.

Re-opening, so that we can keep using this issue to track all flaky tests. We'll still create separate issues for each failing test, but link to this one so that you can easily cmd+F this issue to see if a test is already reported as being flaky (the GH search function doesn't always work as well as you want it to, unfortunately).

Closing this since the issue is not particularly helpful anymore. Please continue to open individual issues for each test removed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

binarylogic picture binarylogic  路  4Comments

jhgg picture jhgg  路  4Comments

a-rodin picture a-rodin  路  3Comments

LucioFranco picture LucioFranco  路  3Comments

leebenson picture leebenson  路  3Comments