We have a number of flaky tests that are imposing a heavy cost on our dev and CI process. We should have a strategy to eliminate them. I propose the following:
The tests currently provide negative value, so deleting them gets us back to a clean state as quickly as possible. We can then add them back (written in a more reliable fashion) as deemed valuable. If they're not valuable enough to rewrite, that's fine too.
@lukesteensen @Hoverbear @fanatid @bruceg @ktff, could you list any tests that you are aware of and we can get to work on this? I'd like to compile a concrete list we can work against.
encodes_histogram_without_timestamp
encodes_histogram_without_timestamp.log
sources::journald::tests::filter_unit_works_correctly
https://github.com/timberio/vector/runs/855687407
@Hoverbear where did you see encodes_histogram_without_timestamp failing? It should only be failing in #2913 like that, which isn't merged.
topology::config::watcher::tests::multi_file_update on mac
https://github.com/timberio/vector/pull/3010/checks?check_run_id=857600333
@ktff You're right! Sorry!
https://github.com/timberio/vector/issues/2978#issuecomment-656620358 is #3000
topology::reload_tests::topology_reuse_old_port on windows
https://github.com/timberio/vector/pull/3010/checks?check_run_id=857742471
https://github.com/timberio/vector/pull/3017#issuecomment-656849980
test_max_size_resume
test_max_size_resume.log
I've seen this one before I think
@Hoverbear not suprising, they were addressed in #2862, but the race to the file mentioned in comment is quite persistent. (EDIT: It's something else)
tests\tcp::merge on windows
https://github.com/timberio/vector/pull/3010/checks?check_run_id=861595079
test_udp_syslog
https://github.com/timberio/vector/pull/2955/checks?check_run_id=861560287
The PR changes syslog but not the udp part, and the test passes locally.
sinks::aws_s3::integration_tests::s3_waits_for_full_batch_or_timeout_before_sending
https://github.com/timberio/vector/pull/3099/checks?check_run_id=886641917
https://github.com/timberio/vector/runs/868843252
sources::socket::test::tcp_gracefull_shutdown
https://github.com/timberio/vector/runs/874679387
Seems to be failing because of OS errors, probably firewall.
@ktff thanks for working through these. How are we feeling about closing this issue given the above? Are there tests left that we need to remove?
Are there tests left that we need to remove?
Based on recent CI runs, these are the only flaky tests, so no.
It should be fine to close it. If another one popes up, it can be dealt with in a regular way.
Sounds good 馃憤
As per https://github.com/timberio/vector/pull/3416#issuecomment-672414018
test_reclaim_disk_space
Re-opening, so that we can keep using this issue to track all flaky tests. We'll still create separate issues for each failing test, but link to this one so that you can easily cmd+F this issue to see if a test is already reported as being flaky (the GH search function doesn't always work as well as you want it to, unfortunately).
Closing this since the issue is not particularly helpful anymore. Please continue to open individual issues for each test removed.