CI is in a sad state which means a lot of CI gets rerun which increases queues and it is hard to make solid releases with flaky CI. This is an issue to collect the different CI problems:
1376.504621 DelimitedFiles ──────── 7.522365
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated
94.756777 lock.jl
994.803373 threads.jl
997.187098 weak
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated
Example logs: https://travis-ci.org/JuliaLang/julia/jobs/383489913, https://travis-ci.org/JuliaLang/julia/jobs/383452631
970.597421 ccache g++ -m64 -pipe -fPIC -fno-rtti -pedantic -D_FILE_OFFSET_BITS=64 -O0 -ggdb2 -DJL_DEBUG_BUILD -fstack-protector-all -I/home/travis/build/JuliaLang/julia/src -I/home/travis/build/JuliaLang/julia/src -I/home/travis/build/JuliaLang/julia/src/support -I/home/travis/build/JuliaLan
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
Might be the same issue as the previous one. It is odd that it freezes in the middle of writing a word...
Example log: https://travis-ci.org/JuliaLang/julia/jobs/383458103
Happens frequently.
Top 15 test groups in terms of time spent (seconds):
LinearAlgebra/triangular   (18) | 478,70
subarray           (3) | 267,89
loading           (10) | 234,47
Distributed          (1) | 225,84
SparseArrays/sparsevector  (17) | 220,87
cmdlineargs         (14) | 188,56
bitarray           (8) | 181,23
SparseArrays/sparse     (14) | 179,24
LinearAlgebra/dense     (19) | 163,49
SparseArrays/higherorderfns (16) | 133,00
LinearAlgebra/diagonal    (21) | 120,63
arrayops           (5) | 117,70
LinearAlgebra/lu       (24) | 116,70
LinearAlgebra/qr       (19) | 107,32
LinearAlgebra/cholesky    (23) | 103,69
Time to build sysimg:
Non-debug:
Sysimage built. Summary:
Total ─────── 329.318508 seconds
Base: ─────── 82.369702 seconds 25.0122%
Stdlibs: ──── 207.425770 seconds 62.9864%
Precompile: ─ 39.519371 seconds 12.0003%
Debug:
Total ─────── 640.483594 seconds
Base: ─────── 145.372932 seconds 22.6974%
Stdlibs: ──── 406.893176 seconds 63.5291%
Precompile: ─ 88.214408 seconds 13.7731%
Example log: https://ci.appveyor.com/project/JuliaLang/julia/build/1.0.27099/job/fevqdpy21ka8btux
E: Unable to locate package g++-4.8-multilib
E: Couldn't find any package by glob 'g++-4.8-multilib'
E: Couldn't find any package by regex 'g++-4.8-multilib'
E: Unable to locate package gfortran-4.8-multilib
E: Couldn't find any package by glob 'gfortran-4.8-multilib'
E: Couldn't find any package by regex 'gfortran-4.8-multilib'
Exited with code 100
~Example log: https://circleci.com/gh/JuliaLang/julia/25927?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link~
~Perhaps fixed by https://github.com/JuliaLang/julia/pull/27257.~
Seems fairly solid
On Appveyor, it is roughly 35 minutes to build the sysimg, and the rest of the time is to run tests. Turning off the debug build might shave off 10 mins. We could probably run a smaller testsuite. Sharding across multiple appveyor jobs will probably be more complex and overall slow the queue further.
For the record, FreeBSD CI ran into some issues like #23143 and randomly freezing in kernel stress testing.
Do we really start 32 workers on appveyor in tests? Might it be oversubscribing things, or is that by design?
Travis change happened suddenly and in response to no corresponding change on our end. Did they maybe water down their VMs again? It's happened several times before with a similar effect every time.
Do we really start 32 workers on appveyor in tests? Might it be oversubscribing things, or is that by design?
Not at once. When a worker takes too much RSS memory it exit and starts a new worker. 32 is just the total number of workers that got started during all the tests.
@iblis17 I am just curious what it takes to reproduce your setup on a linux box? Also, perhaps on a windows box?
I merged #27257. Hopefully that should get circle back in business.
I think we can turn off the debug builds. Or, we could just build libjulia-debug to make sure the debug build works, but not build the system image in debug mode (since it takes a while and is not really different from the release build).
Trying to do the simplest thing in https://github.com/JuliaLang/julia/pull/27263 for Appveyor. This will disable the full debug build. Let's see if it helps.
Perhaps they are throttling us on travis so that we migrate to the new thing?
https://blog.travis-ci.com/2018-05-02-open-source-projects-on-travis-ci-com-with-github-apps
@ViralBShah It's just a normal BuildBot setup. I don't think my setup is different from https://build.julialang.org/.
I only spent effort on daily maintenance: first, check zombie/frozen processes and killed them manually (to releasing memory). Not sure why there are some processes cannot be killed by BuildBot.
Second, I will browse the build history. Re-run the false negative build (e.g. the frozen one), if I find any.
I do these work almost everyday.
At least for Linux, setting up a BuildBot would be fairly trivial (and we likely have the capacity). Mac is probably more challenging.
Appveyor is now on increased capacity of 10 concurrent workers, with time allocation of 3 hours.
Great, only Travis left to figure out then!
They wrote back saying that they can only help us early next week (which may or may not even be Monday).
Also, the debug build is 20 minutes. Is that really worthwhile to build?
Looking at the history of AV, the last successful run was on 23da960088d9e9d48bafff45db3a55ec6b86795f, 6 days ago. It took about 2 hours. Many builds were canceled just after that to have CI time for the alpha release. But every build that ran and didn't fail for other reasons timed out after 3 hours. What's happened there?
I think that is https://github.com/JuliaLang/julia/issues/27274
Especially https://github.com/JuliaLang/julia/issues/27274#issuecomment-392300404, yes. Thanks for the pointer.
Most helpful comment
Appveyor is now on increased capacity of 10 concurrent workers, with time allocation of 3 hours.