Elixir: Compiler hang

Created on 27 Apr 2020  路  30Comments  路  Source: elixir-lang/elixir

Environment

  • Elixir & Erlang/OTP versions (elixir --version):
Erlang/OTP 22 [erts-10.7.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe] [dtrace]
Elixir 1.10.3 (compiled with Erlang/OTP 22)
  • Operating system: macOS Catalina

Behavior

While doing a clean build of an app (via mix test) after upgrading to Elixir 1.10.3 (from 1.10.2), I hit a hang in the compiler. It printed the "taking longer than 15s..." warning for many files, then stopped making any progress.

After 10 ~ 15 min I took a crash dump using SIGUSR1. At that point, the only process with messages in its queue was code_server (MsgQ=7). The code server is inside do_purge, waiting on a reply from erts_code_purge, which is itself inside a do_purge function, purging a module called elixir_compiler_18.

image

The request looks like it came from another process that was compiling some
LiveComponent part of the app:

image

I can't share the crash dump, but is there something I could check that might help?

This didn't reproduce after I killed the compiler and tried again.

Elixir (compiler) Bug

Most helpful comment

Nasty fix!

image

And this, ladies and gentlemen, is the reason we ALL love programming right?

All 30 comments

I've also noticed similar hangs and slowdowns lately. Those happen both on local MacOS and CI (Debian Linux) and started happening before 10.3 was released.

Unfortunately I don't have anything that can help in pinpointing the issue but CI logs like

==> ecto
Compiling 55 files (.ex)
Compiling lib/ecto/repo/queryable.ex (it's taking more than 15s)
Compiling lib/ecto/query/inspect.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/filter.ex (it's taking more than 15s)
Compiling lib/ecto/repo/preloader.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/windows.ex (it's taking more than 15s)
Compiling lib/ecto/schema.ex (it's taking more than 15s)
Compiling lib/ecto/repo/schema.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/preload.ex (it's taking more than 15s)
Compiling lib/ecto/embedded.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/order_by.ex (it's taking more than 15s)
Compiling lib/ecto/type.ex (it's taking more than 15s)
Compiling lib/ecto/uuid.ex (it's taking more than 15s)
Compiling lib/ecto/association.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/join.ex (it's taking more than 15s)
Compiling lib/ecto/log_entry.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/cte.ex (it's taking more than 15s)
Compiling lib/ecto/query/planner.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/select.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/distinct.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/limit_offset.ex (it's taking more than 15s)
Compiling lib/ecto/changeset.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/group_by.ex (it's taking more than 15s)
Compiling lib/ecto/changeset/relation.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/dynamic.ex (it's taking more than 15s)
Compiling lib/ecto/multi.ex (it's taking more than 15s)
Compiling lib/ecto.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/update.ex (it's taking more than 15s)

Too long with no output (exceeded 10m0s): context deadline exceeded

Thanks @dmorneau. We have recently heard similar reports. Can you check if there is any process waiting on the code_server/code:purge call? If so, what is the callsite?

@lukaszsamson, do you remember if this happened with v1.9? Or did it start with v1.10? Tracking this down would help us a lot.

@josevalim IIRC it started happening 2-3 weeks ago and since over a month all of our build pipelines run on 1.10. The first CI failure from our servers is from 19 days ago. If I had to guess I'd blame it on otp 22.3.2 released 21 days ago.

@lukaszsamson can you please use OTP 22.2 on CI and see if it addresses the issue?

If your CI doesn't allow you to choose a particular OTP version, but you can use a custom Docker image, you can try one of these: https://hub.docker.com/r/hexpm/elixir/tags?name=1.10.3-erlang-22.2.8&page=1

Can you check if there is any process waiting on the code_server/code:purge call? If so, what is the callsite?

@josevalim Yes, it's the second screenshot I pasted. It looks like this is the call:

https://github.com/elixir-lang/elixir/blob/v1.10.3/lib/elixir/src/elixir_compiler.erl#L77
https://github.com/erlang/otp/blob/OTP-22.3.2/lib/kernel/src/code.erl#L199

Thanks! It seems there was a bugfix in the code purger on 22.3.0 to 22.3.1, so I would recommend those running into it to revert to 22.2 and see if the issue persists.

@lukaszsamson can you please use OTP 22.2 on CI and see if it addresses the issue?

We can but it's not going to be a drop in replacement as we currently use circleci/elixir images which do not tag otp releases

We can but it's not going to be a drop in replacement as we currently use circleci/elixir images which do not tag otp releases

As @wojtekmach said you can use these docker images which do tag the OTP version: https://hub.docker.com/r/hexpm/elixir/tags?name=1.10.3-erlang-22.2.8&page=1

On our side we have seen the problem happening when we switched from 22.3 to 22.3.2. Currently trying to downgrade to see if it will fix it.

We use CircleCI images as well and we're unable to use the Hex images due to them lacking git (and possibly other things)

If anyone runs into this issue, please do:

  1. If you are on Linux, do kill -ABRT <pid of beam.smp>
  2. If you are on elsewhere, press Ctrl+C, then in the break menu type "A" (uppercase) and press enter

I have opened up an issue with Erlang/OTP here. @dmorneau, if it is an option, please consider sending the crash dump privately to them.

Just upgraded to Erlang 22.3.3, the hang happened still (and on Elixir 1.10.3), and finally got an erl_crash.dump file.

Will I be able to upload it to the Erlang issue tracker if I sign up there?

@dimitarvp yes, you should. If not, feel free to email me (email is on my profile), and I can make sure it is delivered.

@josevalim I was able to get a dump on macos with

lldb --attach-pid <PID>
process save-core "filename"

will that be helpful? In the OTP bug only linux dumps were mentioned.

@lukaszsamson send it there just in case it can help someone. :)

OTP 22.3.4 has been released with some fixes that may (or may not) affect this. If folks can upgrade to latest 22.3.4 and see if the issues persist or not, it would be very welcome. Thank you!

I have made a first test: on my computer (Mac), I don't remember strictly seeing the slow-down/hang, but currently the compilation seems fast with OTP 22.3.4. I'll be able to make a second test once 22.3.4 is available through https://packages.erlang-solutions.com/erlang-solutions_2.0_all.deb (currently it is apparently not), because this is what I use in Docker and where the slow compilation occurred for me. I will try again later!

As mentioned in the linked Erlang ticket: I checked with a big-ish project of mine and I have seen no hangs or slowdowns during compilation with 22.3.4.

Didn't want to be the bringer of evil news but I still experience hangs on CI with 22.3.4 and elixir 1.10.3 on a large project. It is kinda random when it triggers and it is difficult to re-run with ssh the build job so that I could get any dumps.

Same here, unfortunately - still seeing compiler timeouts after 10 minutes in CI occasionally. Here's the info for our build that just timed out:

Build-agent version 1.0.31788-6ad45373 (2020-05-07T13:35:11+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux 370ffda9bb7e 4.15.0-1052-aws #54-Ubuntu SMP Tue Oct 1 15:43:26 UTC 2019 x86_64 Linux
Starting container hexpm/elixir:1.10.3-erlang-22.3.4-alpine-3.11.3
  image cache not found on this host, downloading hexpm/elixir:1.10.3-erlang-22.3.4-alpine-3.11.3
1.10.3-erlang-22.3.4-alpine-3.11.3: Pulling from hexpm/elixir
c9b1b535fdd9: Already exists
cee2ac6549f7: Pulling fs layer
b5216944cd8d: Pulling fs layer
ecbca7291e76: Pulling fs layer
cee2ac6549f7: Verifying Checksum
cee2ac6549f7: Download complete
ecbca7291e76: Verifying Checksum
ecbca7291e76: Download complete
cee2ac6549f7: Pull complete
b5216944cd8d: Verifying Checksum
b5216944cd8d: Download complete
b5216944cd8d: Pull complete
ecbca7291e76: Pull complete
Digest: sha256:f815d0b1a551954023535184f75bbf2c8b90f81afbcb0baacbecf0ac3d502a7a
Status: Downloaded newer image for hexpm/elixir:1.10.3-erlang-22.3.4-alpine-3.11.3
  using image hexpm/elixir@sha256:f815d0b1a551954023535184f75bbf2c8b90f81afbcb0baacbecf0ac3d502a7a

It appears to be random for us as well, as I can retry the build on the same git SHA and get a build that compiles just fine.

We are also seeing this behavior with the same random behavior from the same SHAs. We're on 1.10.2, as a data point. I had to cancel compile steps that had been running for 1.5h. Rerunning with same inputs succeeded in 1-2m. Please let me know if I can provide data to help diagnose.

As others have reported 22.3.4 does not fix the issue. I captured another dump and attached to the erlang bug report.

The OTP team has provided a modified version of Erlang's source: https://bugs.erlang.org/browse/ERL-1236?focusedCommentId=17936&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17936

If someone could run it on Debian, it would be very much appreciated. You can use something like kerl to compile such version locally. Please ask if you have any questions.

Just FYI, I'm also seeing this issue on Ubuntu (compiles that hang). Env invo:

OS: Ubuntu 18.04.4 LTS x86_64 
CPU: Intel Xeon E5-2676 v3 (1) @ 2.399GHz 
Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:1] [hipe]
Elixir 1.10.3 (compiled with Erlang/OTP 22)

Seems to have started appearing after I updated Erlang recently. Also, not sure if this can result in problems such as this but Elixir seems to be compiled against OTP22 while I now have OTP23 installed.

As others have stated, the behavior seems nondeterministic (ugh). Killing the process and rerunning the same command again has succeeded most of the time so far.

If I can help at all with additional info or diagnostics, please let me know!

This should be fixed on Erlang 22.3.4.1 released today. A fix for 23.0 is coming out soon. Thanks everyone for helping!

Nasty fix!

image

And this, ladies and gentlemen, is the reason we ALL love programming right?

Hello,

We are trying to switch from Erlang OTP v22.3 to OTP v23.0.3 in our project.

_@ubuntu:/mnt/src/agilis_fw/common_core.fw.x86# elixir --version
Erlang/OTP 23 [erts-11.0.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

Elixir 1.10.3 (compiled with Erlang/OTP 23)_

We encounter the hand followed by the erl_crash.dump on each and every module doing the:

_import OurApp.Gettext_

I deem it might be related to the issue above.

I can share the crashdump:

erl_crash.dump.gz.part-ab.gz
erl_crash.dump.gz.part-ac.gz
erl_crash.dump.gz.part-aa.gz

The ^^ is a one gzip archive split with the split -b 9m command to circumvent the github's restrictions.

With my best wishes
Tomasz Motyl

This should be fixed on Erlang 22.3.4.1 released today. A fix for 23.0 is coming out soon. Thanks everyone for helping!

We switched to Erlang 22.3.4.1. We still observe Erlang compiler hangs about 5-10% of our runs in CI - we have almost 800K of Erlang in our project spread over 8 files. (ASN.1 codecs generated by asn1ct.) It seems that the compiler hangs until the job is killed about 40 minutes later.

Opened this: https://bugs.erlang.org/browse/ERL-1433

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andrewcottage picture andrewcottage  路  3Comments

josevalim picture josevalim  路  3Comments

DEvil0000 picture DEvil0000  路  3Comments

ericmj picture ericmj  路  3Comments

ckampfe picture ckampfe  路  3Comments