Nim: Long-lived application occasionally dies with SIG_PIPE when calling httpclient.request

Created on 5 Dec 2018  Â·  40Comments  Â·  Source: nim-lang/Nim

Our serverside application, currently on 0.18, does http the occasional http request to the Google APIs. Sometimes, I think, google closes the connection on us and the application dies with SIG_PIPE:

...
keyprovision_google.nim(66) getAccessToken
keyprovision_google.nim(30) requestJson
keyprovision_google.nim(71) :anonymous
httpclient.nim(1196)     request
httpclient.nim(1181)     requestAux
httpclient.nim(1031)     parseResponse
httpclient.nim(883)      close
net.nim(866)             close
SIGPIPE: Pipe closed.

Of course, the http request() is wrapped into try/except; that does not catch the signal however.

This is running on nim 0.18 (I can't try 0.19 yet, porting it is significant effort due to libraries used), and in a docker container on Linux.

I found this thread: https://forum.nim-lang.org/t/4417, but that didn't really help. The httpclient is already configured to use a reasonable timeout (3s).

I'm fairly certain this is because Google is closing the kept-alive connection in the background; but I'm hesitant to add Connection: Close (or otherwise close the connection on our end) unless that's the only way to solve this.

High Priority Showstopper Stdlib

Most helpful comment

This is not a Django issue. The cause of this bug is described in https://github.com/nim-lang/Nim/issues/9867#issuecomment-473902394

With a bit effort you can simulate it by raising error on a call to socket.close(). https://github.com/nim-lang/Nim/blob/devel/lib/pure/asyncnet.nim#L721 - replace this line with raising the exception unconditionally and then call socket.close() twice. That is the essence of the problem. The second call results in corrupted memory, because SslShutdown and SSLFree are called second time on the pointers that are already freed. There is socket.closed = true, but it is not getting checked.
The actual user code may have this problem even without calling close() twice - when you use AsyncHttpClient, it closes the socket internally, and you also need to close() the AsyncHttpClient itself. So even if you have single close() in your code, and the code itself is correct, there is a corruption. And anyway, worst case should be you get an exception when you misuse an API, not a corruption.

The AsyncSocket and SSL APIs in Nim are in general very fragile and easy to misuse, creating leaks and other bugs. For example, you may call wrapSocket() on a socket that was already wrapped and previous pointers will be leaked. Thorough review of the API is necessary, with exceptional cases being documented and handled where necessary. At the moment I wouldn't suggest using these APIs in production applications - they are ridden with bugs even if you use them correctly, and you may not get a timely error when you misuse them.

All 40 comments

huh, it's SIGPIPE'ing on a close. Interesting, it appears to be OpenSSL's fault: https://github.com/nim-lang/Nim/blob/v0.18.0/lib/pure/net.nim#L866.

Anyway, please follow the advice here: https://stackoverflow.com/a/108192/492186. As a workaround you can ignore all SIGPIPEs in your program, as a proper fix I guess we need to set some OpenSSL flag or something.

Heya dom96, I tried that, like so:

from posix import signal, SIG_PIPE, SIG_IGN
signal(SIG_PIPE, SIG_IGN)

It seemed to have worked for a good while, but today it fell over with this (one exception per line):

Broken pipe Additional info: IO error has occurred in the BIO layer; will retry
not isClosed(socket) Cannot `send` on a closed socket; will retry
Broken pipe Additional info: IO error has occurred in the BIO layer; not retrying anymore

All future requests then also failed on the assert not isClosed and I had to bump the process to get functionality back.

Hrm, would be nice to get stack traces for those exceptions. I'm guessing
httpclient doesn't handle the socket getting closed correctly.

On Tue, 11 Dec 2018 at 09:12, niv notifications@github.com wrote:

Heya dom96, I tried that, like so:

from posix import signal, SIG_PIPE, SIG_IGNsignal(SIG_PIPE, SIG_IGN)

It seemed to have worked for a good while, but today it fell over with
this (one exception per line):

Broken pipe Additional info: IO error has occurred in the BIO layer; will retry
not isClosed(socket) Cannot send on a closed socket; will retry
Broken pipe Additional info: IO error has occurred in the BIO layer; not retrying anymore

All future requests then also failed on the assert not isClosed and I had
to bump the process to get functionality back.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/nim-lang/Nim/issues/9867#issuecomment-446126983, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPDe--Q2uRyV0armpnf40XAprhuK5nDks5u33b6gaJpZM4ZCheN
.

Ah, sorry. Don't have those anymore.

I've also added workaround code now that force-closes the httpclient after reach request chain, so that the production process doesn't hang itself.

This is a double-close issue that I'm also seeing in production. It's not an OpenSSL issue, as the error sometimes is not (pkey.ident == 0) Descriptor 26 is not registered in the selector! and sometimes it's an OpenSSL crash, caused by double-free, like in the OP's case. Ignoring SIGPIPE or other such signals is not an option as the app will continue working in a corrupted state, which you do not want - better to crash early and restart.

On Socket/AsyncSocket level there is no checking for whether the socket was already closed, and it seems that under some condition httpclient.request may close the socket after a failure occurred without setting connected field to false. By glancing at the code I'm not yet sure where exactly this happens.

The code on which the error occurs looks like this:

var httpClient = newAsyncHttpClient(userAgent = ctx.userAgent)
# headers omitted
let respFuture = httpClient.request(ctx.endpoint, httpMethod = HttpPOST, body = reqBody)
yield respFuture
if respFuture.failed:
  httpClient.close() # Double close here!
  raise respFuture.readError()

A simple fix would be to add connected field to Socket/AsyncSocket objects to avoid any double-close issues. But the httpclient issue should also be fixed.

So apparently the scenario is this:

  1. Server disconnects, leaving the client's underlying OS socket in disconnected state.
  2. AsyncHttpClient calls socket.close(), where SslShutdown errors out (due to the disconnected socket) and so close() raises. This leaves AsyncHttpClient in connected state, since it doesn't handle exceptions from socket.close().
  3. close() is called again and SslShutdown/SSLFree is called for the second time on invalid pointer, corrupting the memory.

So either socket.close() should be made to never raise or else any code that calls it must handle the exceptions. If it raises, it should clean up the state as much as possible (e.g. set sslHandle to nil after freeing it and not use it if it's already nil).

SSL_shutdown can be made to always succeed by calling SSL_CTX_set_quiet_shutdown, according to the SSL_shutdown documentation.

Another important line from SSL_shutdown() doc:

Note that SSL_shutdown() must not be called if a previous fatal error has occurred on a connection i.e. if SSL_get_error() has returned SSL_ERROR_SYSCALL or SSL_ERROR_SSL.

So we should also track whether a fatal error occurred on the connection and not call SSL_shutdown in such case. If the above condition is violated, SSL_shutdown may raise even if SSL_CTX_set_quiet_shutdown() was called.

close() is called again and SslShutdown/SSLFree

This implies that the original exception raised by SslShutdown is getting caught somewhere, if so where?

It results in a failed future returned from request(). Regardless, AsyncHttpClient.close() is expected to be always safe to call, and here it gets in a state where calling it results in a crash or corrupted memory.

@dom96 Here is a stack trace for this issue with nim 1.0.4 and SIG_PIPE ignored (per niv's comments):

Got exception ref 0x7f0485be2098 --> [errorCode = 32,
parent = nil,
name = 0x5597fa08ce9a"OSError",
msg = 0x7f0485be12d8"Broken pipe\10"
"Additional info: \"IO error has occurred in the BIO layer\"",
trace = 0x7f04841c6048@[[procname = 0x5597fa0956f5"eos_watcher_rpc",
line = 239,
filename = 0x5597fa095328"/root/eos-watcher/eos_watcher_rpc.nim"], [procname = 0x5597fa095673"main",
line = 206,
filename = 0x5597fa095328"/root/eos-watcher/eos_watcher_rpc.nim"], [procname = 0x5597fa095468"get_block",
line = 174,
filename = 0x5597fa095328"/root/eos-watcher/eos_watcher_rpc.nim"], [procname = 0x5597fa09157b"request",
line = 932,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa09157b"request",
line = 907,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa09154a"requestAux",
line = 892,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa09150a"parseResponse",
line = 770,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa091500"parseBody",
line = 678,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa09140e"recvFull",
line = 608,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa0911a7"close",
line = 563,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa08f3bb"close",
line = 900,
filename = 0x5597fa08f118"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim"], [procname = 0x5597fa08f2d9"socketError",
line = 735,
filename = 0x5597fa08f118"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim"], [procname = 0x5597fa08cea2"raiseOSError",
line = 94,
filename = 0x5597fa08cd98"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/includes/oserr.nim"]],
raiseId = 0,
up = ref 0x7f0485be2098]
 with message Broken pipe
Additional info: "IO error has occurred in the BIO layer"
Sleeping for 60 seconds...
/root/eos-watcher/eos_watcher_rpc.nim(239) eos_watcher_rpc
/root/eos-watcher/eos_watcher_rpc.nim(206) main
/root/eos-watcher/eos_watcher_rpc.nim(174) get_block
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim(932) request
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim(907) request
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim(886) requestAux
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim(1349) send
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim(1334) send
/root/.choosenim/toolchains/nim-1.0.4/lib/system/assertions.nim(27) failedAssertImpl
/root/.choosenim/toolchains/nim-1.0.4/lib/system/assertions.nim(20) raiseAssert
/root/.choosenim/toolchains/nim-1.0.4/lib/system/fatal.nim(39) sysFatal
Error: unhandled exception: /root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim(1334, 9) `not socket.isClosed` Cannot `send` on a closed socket [AssertionError]
Error: execution of an external program failed: '/root/eos-watcher/eos_watcher_rpc '

This (closing httpclient) is causing my web service with thousands of users to occasionally crash, sometimes more often than others. I tried @endragor's suggestions but it didn't seem to help. Error messages like this:

*** Error in `./nitter': free(): invalid pointer: 0x0000558d1c889a00 ***
*** Error in `./nitter': double free or corruption (!prev): 0x000055b56f28ef20 ***
INFO Jester is making jokes at http://0.0.0.0:5000
Traceback (most recent call last)
/home/zed/src/dev/nim/nitter/src/nitter.nim(33) nitter
/home/zed/src/dev/nim/jester/jester.nim(508) serve
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncdispatch.nim(1879) runForever
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncdispatch.nim(1574) poll
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncdispatch.nim(1348) runOnce
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncdispatch.nim(210) processPendingCallbacks
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncmacro.nim(34) fetchHtmlNimAsyncContinue
/home/zed/src/dev/nim/nitter/src/api/utils.nim(34) fetchHtmlIter
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/httpclient.nim(580) close
/home/zed/.choosentm/toolchains/nim-#devel/lib/pure/asyncnet.nim(697) close
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

I'd give it another try after my openssl changes (#13919 ), this seems to be somewhat related so it's possible it was fixed.

That PR doesn't seem to affect this bug. The bug is not about OpenSSL, it's about Nim code on higher level - namely:

  1. AsyncHttpClient
  2. Usage of SSL_shutdown - it may error out in situations which are expected for the code that uses it, so it should be handled.

See my old comments here - I pinpointed where the problems are.

Another error message:error:140E0197:SSL routines:SSL_shutdown:shutdown while in init

I have a Django / Gunicorn / Nginx server and decided to write some speed critical parts in Nim. I created the .so files with Nimpy and after testing them thoroughly they seem to work well. However, when I use them as part of the server loop in production, I often get a SIGPIPE: Pipe closed and the server dies (ignoring any try - catch).

I was looking for information about this problem which I attributed to Python at first and ended up here. So I removed all the Nim code from the server and now it runs without issues. Therefore I guess this is the same issue you are mentioning in this thread but in this case the server is not written in Nim. In fact I just have a couple of Nim files with one (very simple) function each.

Please let me know if you want to see some specific logs / stack traces to address this issue.

Please let me know if you want to see some specific logs / stack traces to address this issue.

Oh, yes, please!

I've simplified the stack to test it in an easier way. I've removed nginx and gunicorn and run the Python app directly in Django's development server. Also I've removed all calls to nim generated .so modules from the Python code with the exception of a single function. With this arrangement the server keeps crashing. An it does so very early, either on page load or after interacting with the site once or twice.

To get some stacktraces, the first thing I tried is to wrap the Python function that calls the nim .so in try-catches with no luck. Server execution stops with a "SIGPIPE: Pipe closed" message and I am not able to catch and/or print any error. This is an example of the Python development server output:

System check identified no issues (0 silenced).
May 09, 2020 - 15:34:14
Django version 1.10.8, using settings 'D3Tools.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
[09/May/2020 15:34:16] "GET /accounts/logout/ HTTP/1.1" 200 4699
[09/May/2020 15:34:17] "GET /static/accounts/js/main.js HTTP/1.1" 304 0
[09/May/2020 15:34:17] "GET /static/accounts/css/main.css HTTP/1.1" 304 0
[09/May/2020 15:34:17] "GET /static/accounts/water/css/style.css HTTP/1.1" 304 0
[09/May/2020 15:34:17] "GET /static/accounts/images/d3_logo_horizontal_white.png HTTP/1.1" 200 4964
[Server|login]: User truiz login successfull.
[09/May/2020 15:34:25] "POST /accounts/login/ HTTP/1.1" 302 0
[09/May/2020 15:34:25] "GET /accounts/ HTTP/1.1" 200 4960
SIGPIPE: Pipe closed.

Then I tried to compile the nim code with the stackTrace flag:

nim c --threads:on --app:lib --profiler:on --stackTrace:on --out:transformations.so transformations

In these case I get the following output from the server before crashing (instead of the SIGPIPE: Pipe closed):

./GUI/scripts/run_local.sh: line 11: 23939 Segmentation fault (core dumped) python manage.py runserver

Where run_local.sh is just a bash script that compiles webpack and starts the Django development server with "python manage.py runserver". So again, not much information.

If you can tell me which specific logs / stack traces you'd like to see and how to get them I'm happy to provide them.

@tomaso909 trying to minimize your Nim code to be able to share it would really help a lot, or also trying to replicate it without Django (maybe by using the .so file from another Nim program)

The nim code is very small already. Here I attach the transformations.nim file. Just by calling the vectorEulerRotation function from Python I already get the SIG_PIPE errors. I've tested the function in Python without using Django and it works perfectly.

import nimpy
import math


type
    Vector3 = array[3, float]

proc vectorEulerRotation(vector: Vector3, yawPitchRoll: Vector3, reverse: bool): Vector3 {.exportpy.} =

    let yaw: float = degToRad(yawPitchRoll[0])
    let pitch: float = degToRad(yawPitchRoll[1])
    let roll: float = degToRad(yawPitchRoll[2])

    let cosYaw: float = cos(yaw)
    let sinYaw: float = sin(yaw)
    let cosPitch: float = cos(pitch)
    let sinPitch: float = sin(pitch)
    let cosRoll: float = cos(roll)
    let sinRoll: float = sin(roll)

    let matrix_00: float = cosYaw*cosPitch
    let matrix_01: float = cosYaw*sinPitch*sinRoll-sinYaw*cosRoll
    let matrix_02: float = cosYaw*sinPitch*cosRoll+sinYaw*sinRoll
    let matrix_10: float = sinYaw*cosPitch
    let matrix_11: float = sinYaw*sinPitch*sinRoll+cosYaw*cosRoll
    let matrix_12: float = sinYaw*sinPitch*cosRoll-cosYaw*sinRoll
    let matrix_20: float = -sinPitch
    let matrix_21: float = cosPitch*sinRoll
    let matrix_22: float = cosPitch*cosRoll

    if reverse:
        let rotated: Vector3 = [
            matrix_00*vector[0]+matrix_10*vector[1]+matrix_20*vector[2],
            matrix_01*vector[0]+matrix_11*vector[1]+matrix_21*vector[2],
            matrix_02*vector[0]+matrix_12*vector[1]+matrix_22*vector[2]
        ]
        return rotated
    else:
        let rotated: Vector3 = [
            matrix_00*vector[0]+matrix_01*vector[1]+matrix_02*vector[2],
            matrix_10*vector[0]+matrix_11*vector[1]+matrix_12*vector[2],
            matrix_20*vector[0]+matrix_21*vector[1]+matrix_22*vector[2]
        ]
        return rotated

This function simply transforms a vector. It does not do anything network related.

@tomaso909 I can replicate it with a simple Django app, will try to understand why that happens, thanks for the info :)

@tomaso909 I think it might be not related to Nim at all - maybe it's the Django's fault really? I can reproduce it when using wrk to stress-test Django and then exit wrk with Ctrl+C, maybe it's Django who's sending those to the script or something?

@Yardanico I see your point. But if I do not import the Nim function in Django the server works without problems. So there is an interaction there that causes the error.

I posted the problem here because it looks the same you have in pure Nim servers and I was not able to find information for Django on this exact error.

@Yardanico @Araq If my case is interesting to you I can prepare a small reproducible example with a very minimal Django server. On the other hand, if you think this is more of a Django issue (or Nimpy perhaps) I can ask elsewhere.
Let me know. I'm loving Nim and I'm happy to help.

This is not a Django issue. The cause of this bug is described in https://github.com/nim-lang/Nim/issues/9867#issuecomment-473902394

With a bit effort you can simulate it by raising error on a call to socket.close(). https://github.com/nim-lang/Nim/blob/devel/lib/pure/asyncnet.nim#L721 - replace this line with raising the exception unconditionally and then call socket.close() twice. That is the essence of the problem. The second call results in corrupted memory, because SslShutdown and SSLFree are called second time on the pointers that are already freed. There is socket.closed = true, but it is not getting checked.
The actual user code may have this problem even without calling close() twice - when you use AsyncHttpClient, it closes the socket internally, and you also need to close() the AsyncHttpClient itself. So even if you have single close() in your code, and the code itself is correct, there is a corruption. And anyway, worst case should be you get an exception when you misuse an API, not a corruption.

The AsyncSocket and SSL APIs in Nim are in general very fragile and easy to misuse, creating leaks and other bugs. For example, you may call wrapSocket() on a socket that was already wrapped and previous pointers will be leaked. Thorough review of the API is necessary, with exceptional cases being documented and handled where necessary. At the moment I wouldn't suggest using these APIs in production applications - they are ridden with bugs even if you use them correctly, and you may not get a timely error when you misuse them.

The AsyncSocket and SSL APIs in Nim are in general very fragile and easy to misuse, creating leaks and other bugs.

This IMO qualifies this as a showstopper. I'll let Araq change it if he disagress but otherwise we could use some support to fix it.

I am using ssl + asynchttpclient and getting a segfault from a long-running application, last one was after 2 days.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f10582e61bd in ssl3_read_internal () from /lib64/libssl.so.1.1                                                                                                                                                                       
(gdb) bt                                                                                                                                                                                                         
#0  0x00007f10582e61bd in ssl3_read_internal () from /lib64/libssl.so.1.1                                                                                                                                        
#1  0x00007f10582f133e in ssl_read_internal () from /lib64/libssl.so.1.1
#2  0x00007f10582f1447 in SSL_read () from /lib64/libssl.so.1.1                                                                                                                                                  
#3  0x00000000004278af in ?? ()                                                                                                                                                                                  
#4  0x0000000000427b63 in ?? ()                                                                                                                                                                                  
#5  0x0000000000427f08 in ?? ()                                                                                                                                                                                  
#6  0x0000000000428057 in ?? ()                                                                                                                                                                                  
#7  0x000000000042811a in ?? ()                                                                                                                                                                                  
#8  0x000000000042847c in ?? ()
#9  0x000000000042fd19 in ?? ()                                                                                                                                                                                  
#10 0x000000000042fde1 in ?? ()                                                                                                                                                                                  
#11 0x000000000043013b in ?? ()                                                                                                                                                                                  
#12 0x0000000000431bfd in ?? ()                                                                                                                                                                                  
#13 0x0000000000431f88 in ?? ()                                                                                                                                                                                  
#14 0x00000000004210d9 in ?? ()                                                                                                                                                                                  
#15 0x000000000042168e in ?? ()                                                                                                                                                                                  
#16 0x0000000000458b6a in ?? ()                                                                                                                                                                                  
#17 0x00007f1058e9fe9c in __run_exit_handlers () from /lib64/libc.so.6                                                                                                                                           
#18 0x00007f1058e9ffd0 in exit () from /lib64/libc.so.6                                                                                                                                                          
#19 0x0000000000406f9a in ?? ()                                                                                                                                                                                  
#20 <signal handler called>                                                                                                                                                                                      
#21 0x00007f10582e61bd in ssl3_read_internal () from /lib64/libssl.so.1.1                                                                                                                                        
#22 0x00007f10582f133e in ssl_read_internal () from /lib64/libssl.so.1.1                                                                                                                                         
#23 0x00007f10582f1447 in SSL_read () from /lib64/libssl.so.1.1

Fair to assume that is a symptom of the underlying issue here?

There's some mentions of possible work on this, I used 1.3.5 devel (don't have the git hash, but can re-compile with another release and debugging/stacktracing on).

Please recompile with stack trace on. My guess would be that SSL_shutdown() was called on a closed socket, which will generates SIGPIPE if it attempts to negotiate a shutdown with the client and the socket is closed (which will be true most of the time). I can do a fix for this case.

Also I'd recommend that everyone running their Nim program as a SSL server should install SIG_IGN as the SIGPIPE handler (actually we might want to do this in the stdlib). SIGPIPE is a pretty old signal and is largely obsoleted by EPIPE that's returned from any failed send()/write() attempt on a closed socket/pipe (we don't get to handle this one because SIGPIPE got in the way).

Please recompile with stack trace on. My guess would be that SSL_shutdown() was called on a closed socket, which will generates SIGPIPE if it attempts to negotiate a shutdown with the client and the socket is closed (which will be true most of the time). I can do a fix for this case.

Will do

Please recompile with stack trace on. My guess would be that SSL_shutdown() was called on a closed socket, which will generates SIGPIPE if it attempts to negotiate a shutdown with the client and the socket is closed (which will be true most of the time). I can do a fix for this case.

Will do

don't mean to necro, but the reason I haven't posted anything is, since that message, is because I had pulled latest devel, and recompiled with stack tracing on etc... and it hasn't crashed since. I will probably recompile with the currently latest devel, and notify back if it runs into issues going forward.

I will probably recompile with the currently latest devel, and notify back if it runs into issues going forward.

Please do :) And thanks for letting me know that previous patches does manage to prevent crashes in certain configurations. With the latest fix, at least for the kind of errors in the first post, is fully resolved. However only in real-world testing would we know if this is the case 100% of the time.

I will probably recompile with the currently latest devel, and notify back if it runs into issues going forward.

Please do :) And thanks for letting me know that previous patches does manage to prevent crashes in certain configurations. With the latest fix, at least for the kind of errors in the first post, is fully resolved. However only in real-world testing would we know if this is the case 100% of the time.

Seems to be segfaulting on something else now, after using latest nightly compiler

Aug 04 15:26:36 *hostname* app[356277]: DBG 2020-08-04 15:26:36-04:00 Blocking main thread for future resolution tid=356277                                                                                 
Aug 04 15:26:44 *hostname* app[356277]: Traceback (most recent call last)
Aug 04 15:26:44 *hostname* app[356277]: my_app.nim my_app
Aug 04 15:26:44 *hostname* app[356277]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 04 15:26:44 *hostname* app[356277]: cligen.nim               dispatchapp
Aug 04 15:26:44 *hostname* app[356277]: my_app.nim app
Aug 04 15:26:44 *hostname* app[356277]: log_output.nim           initLogRecord
Aug 04 15:26:44 *hostname* app[356277]: terminal.nim             ansiForegroundColorCode
Aug 04 15:26:44 *hostname* app[356277]: terminal.nim             ansiStyleCode
Aug 04 15:26:44 *hostname* app[356277]: gc.nim                   newObjNoInit
Aug 04 15:26:44 *hostname* app[356277]: alloc.nim                rawAlloc
Aug 04 15:26:44 *hostname* app[356277]: SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Aug 04 15:26:51 *hostname* systemd[1]: app.service: Main process exited, code=dumped, status=11/SEGV

gdb coredump backtrace doesn't yield anything useful, I may try another nightly, as this appears a seperate problem, but still adding here for posterity and in case I'm mistaken.

@D-Nice can you try with --gc:arc ?

@Yardanico the project imports chronicles package, which has a dependency on chronos, which apparently won't compile with arc/orc due to some unsupported cast from chronos.

EDIT: Forgot to mention, it just crashed again after 3ish hours.

I am attempting other GCs, and have previously run this app with boehm, without issues, but on latest devel, boehm now crashes very shortly after initialization.

First crash with boehm, compressed bin with upx

Aug 04 19:52:49 *hostname* app[371851]: ERR 2020-08-04 19:52:49-04:00 Main event:                                tid=371851                                                                                    
Aug 04 19:52:49 *hostname* app[371851]: error:140E0197:SSL routines:SSL_shutdown:shutdown while in init                                                                                                        
Aug 04 19:52:49 *hostname* app[371851]: Traceback (most recent call last)
Aug 04 19:52:49 *hostname* app[371851]: my_app.nim my_app
Aug 04 19:52:49 *hostname* app[371851]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 04 19:52:49 *hostname* app[371851]: cligen.nim               dispatchapp
Aug 04 19:52:49 *hostname* app[371851]: my_app.nim app
Aug 04 19:52:49 *hostname* app[371851]: DBG 2020-08-04 19:52:49-04:00 Making Request                             tid=371851
Aug 04 19:52:49 *hostname* app[371851]: Traceback (most recent call last)
Aug 04 19:52:49 *hostname* app[371851]: my_app.nim my_app
Aug 04 19:52:49 *hostname* app[371851]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 04 19:52:49 *hostname* app[371851]: cligen.nim               dispatchapp
Aug 04 19:52:49 *hostname* app[371851]: my_app.nim app
Aug 04 19:52:49 *hostname* app[371851]: asyncdispatch.nim        waitFor
Aug 04 19:52:49 *hostname* app[371851]: asyncdispatch.nim        poll
Aug 04 19:52:49 *hostname* app[371851]: asyncdispatch.nim        runOnce
Aug 04 19:52:49 *hostname* app[371851]: asyncdispatch.nim        processPendingCallbacks
Aug 04 19:52:49 *hostname* app[371851]: asyncmacro.nim           sendNimAsyncContinue
Aug 04 19:52:49 *hostname* app[371851]: asyncmacro.nim           sendIter
Aug 04 19:52:49 *hostname* app[371851]: SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Aug 04 19:52:50 *hostname* systemd[1]: app.service: Main process exited, code=dumped, status=11/SEGV

Second crash with boehm, uncompressed bin

Aug 04 19:58:59 *hostname* app[371964]: DBG 2020-08-04 19:58:59-04:00 Blocking main thread for future resolution tid=371964                                                                                    
Aug 04 19:59:11 *hostname* app[371964]: Traceback (most recent call last)
Aug 04 19:59:11 *hostname* app[371964]: my_app.nim my_app
Aug 04 19:59:11 *hostname* app[371964]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 04 19:59:11 *hostname* app[371964]: cligen.nim               dispatchapp
Aug 04 19:59:11 *hostname* app[371964]: my_app.nim app
Aug 04 19:59:11 *hostname* app[371964]: asyncdispatch.nim        waitFor
Aug 04 19:59:11 *hostname* app[371964]: asyncdispatch.nim        poll
Aug 04 19:59:11 *hostname* app[371964]: asyncdispatch.nim        runOnce
Aug 04 19:59:11 *hostname* app[371964]: asyncdispatch.nim        processPendingCallbacks
Aug 04 19:59:11 *hostname* app[371964]: asyncmacro.nim           sendNimAsyncContinue
Aug 04 19:59:11 *hostname* app[371964]: asyncmacro.nim           sendIter
Aug 04 19:59:11 *hostname* app[371964]: openssl.nim              deallocWrapper
Aug 04 19:59:11 *hostname* app[371964]: boehm.nim                deallocSharedImpl
Aug 04 19:59:11 *hostname* app[371964]: boehm.nim                deallocImpl
Aug 04 19:59:11 *hostname* app[371964]: SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Aug 04 19:59:13 *hostname* systemd[1]: app.service: Main process exited, code=dumped, status=11/SEGV

gdb backtrace of second crash

(gdb) bt
#0  0x00007f589449102a in GC_free () from /lib64/libgc.so.1
#1  0x00000000004060a6 in ?? ()
#2  0x000000000040613c in ?? ()
#3  0x000000000041756f in ?? ()
#4  0x00007f5892cdca74 in ERR_clear_error () from /lib64/libcrypto.so.1.1
#5  0x00007f58927258e0 in state_machine.part () from /lib64/libssl.so.1.1
#6  0x00007f58926fc5d0 in ssl3_write_bytes () from /lib64/libssl.so.1.1
#7  0x00007f589270f5d6 in ssl_write_internal () from /lib64/libssl.so.1.1
#8  0x00007f589270f707 in SSL_write () from /lib64/libssl.so.1.1
#9  0x00000000004353ff in ?? ()
#10 0x000000000043560a in ?? ()
#11 0x000000000042f319 in ?? ()
#12 0x000000000042fbf4 in ?? ()
#13 0x00000000004300b7 in ?? ()
#14 0x000000000047f4c3 in ?? ()
#15 0x00007f5894a37e9c in __run_exit_handlers () from /lib64/libc.so.6
#16 0x00007f5894a37fd0 in exit () from /lib64/libc.so.6
#17 0x0000000000405ea6 in ?? ()
#18 <signal handler called>
#19 0x00007f589449102a in GC_free () from /lib64/libgc.so.1
#20 0x00000000004060a6 in ?? ()
#21 0x000000000040613c in ?? ()
#22 0x000000000041756f in ?? ()
#23 0x00007f5892cdca74 in ERR_clear_error () from /lib64/libcrypto.so.1.1
#24 0x00007f589272a22d in tls_process_server_certificate () from /lib64/libssl.so.1.1
#25 0x00007f589272c965 in ossl_statem_client_process_message () from /lib64/libssl.so.1.1
#26 0x00007f5892726043 in state_machine.part () from /lib64/libssl.so.1.1
#27 0x00007f58926fc5d0 in ssl3_write_bytes () from /lib64/libssl.so.1.1
#28 0x00007f589270f5d6 in ssl_write_internal () from /lib64/libssl.so.1.1
#29 0x00007f589270f707 in SSL_write () from /lib64/libssl.so.1.1
#30 0x00000000004353ff in ?? ()
#31 0x000000000043560a in ?? ()
#32 0x000000000042f319 in ?? ()
#33 0x000000000042fbf4 in ?? ()
#34 0x00000000004300b7 in ?? ()
#35 0x0000000000432753 in ?? ()
#36 0x0000000000489be2 in ?? ()
#37 0x000000000048d75e in ?? ()
#38 0x000000000048da7b in ?? ()
#39 0x00000000004905bc in ?? ()
#40 0x0000000000490631 in ?? ()
#41 0x00007f5894a216a3 in __libc_start_main () from /lib64/libc.so.6
#42 0x000000000040248e in ?? ()

The latest nightly I'm referring to is:

Nim Compiler Version 1.3.5 [Linux: amd64]
Compiled at 2020-08-03
Copyright (c) 2006-2020 by Andreas Rumpf

git hash: d6161e7e9ea1f40ad734c6e811e8169fe24b5e9f
active boot switches: -d:release

As a note, similar crash still in boehm, with 1.2.6. I could swear I ran it fine with some earlier nim 1.2.x, I'll investigate more there.

markAndSweep won't compile at all with nightly for my app, getting
Error: system module needs: unsureAsgnRef
compiles fine with stable 1.2.4 & 1.2.6

Running it for now with 1.2.6 and M&S gc to see how it fares over time.

This is interesting. Can you enable debug information (with --debugger:native) and use the default GC? It should give out coredumps with usable stacktrace then.

Also I'm seeing

error:140E0197:SSL routines:SSL_shutdown:shutdown while in init

in one of your logs. This is really odd as this bug should have been completely fixed in devel. Can you confirm your version with nim -v?

@alaviss

For that SSL error you've mentioned, I can confirm it was d6161e7e9ea1f40ad734c6e811e8169fe24b5e9f for the git hash of compiler used to compile that variant, albeit with boehm gc.

I'm looking to create a minimal reproducible example, and see if I can get same issues going there.

Compiled with recent devel: 2a8bf470af995f4e7d5bb8a0f49bb2226f95c311

Compile flags for the code: --gc:refc --stackTrace:on --debugger:native --skipProjCfg:on -d:chronicles_log_level:TRACE -d:danger --opt:size -d:ssl

backtrace with native debugger flag:

#0  0x00007f9e0cd1a3a3 in OPENSSL_sk_pop_free () from /lib64/libcrypto.so.1.1
#1  0x00007f9e0cd34e9d in X509_VERIFY_PARAM_free () from /lib64/libcrypto.so.1.1
#2  0x00007f9e0c49ed15 in SSL_free () from /lib64/libssl.so.1.1
#3  0x000055b9da4f1662 in close__g8nGs1aFQjSyKaCvnSGH1w (socket=0x7f9e0b9ee030) at /nim/lib/pure/asyncnet.nim:699                                                                                                
#4  0x000055b9da4fb8d3 in close__h9bWTiUUZLlQgLXHDtjyONA (client=0x7f9e0b819f28) at /nim/lib/pure/httpclient.nim:619                                                                                             
#5  0x000055b9da52feb7 in reqBtcAddressDetailsIter__ioevzQT4OCMuTzRWSwCM6g_2 (ClE_0=0x7f9e0bbb2518) at /tmp/project/src/crypto_address_watcher.nim:22                                                            
#6  0x000055b9da5304ac in reqBtcAddressDetailsNimAsyncContinue__Z9cQmJ9aPol6ErCmKfu4D3gw_2 (ClE_0=0x7f9e0b9d07c8) at /nim/lib/pure/asyncmacro.nim:34                                                             
#7  0x000055b9da4ec915 in processPendingCallbacks__xdICRvJ1ns9byYHPjM07uJQ (p=p@entry=0x7f9e0d7b1ae8, didSomeWork=didSomeWork@entry=0x7ffc0706f71f "\001\b\374\006\a\374\177")                                   
    at /nim/lib/pure/asyncdispatch.nim:210
#8  0x000055b9da4ecb8c in runOnce__nEnn7ilZ6aSkcXTtLEKnJA (timeout=timeout@entry=1000) at /nim/lib/pure/asyncdispatch.nim:1340                                                                                   
#9  0x000055b9da4ecf99 in poll__wcvh7vf5dGU1VYto9aiAnpA (timeout=1000) at /nim/lib/pure/asyncdispatch.nim:1576                                                                                                   
#10 0x000055b9da52ee6f in quitPoll__R2CWtKf4BdVZhGAdjU9b3cg_2 () at /tmp/project/src/crypto_address_watcher.nim:242                                                                                              
#11 0x00007f9e0d049e9c in __run_exit_handlers () from /lib64/libc.so.6
#12 0x00007f9e0d049fd0 in exit () from /lib64/libc.so.6
#13 0x000055b9da4c482b in signalHandler (sign=<optimized out>) at /nim/lib/system/excpt.nim:624
#14 <signal handler called>
#15 0x00007f9e0cd4d479 in OPENSSL_cleanse () from /lib64/libcrypto.so.1.1
#16 0x00007f9e0ccb9195 in CRYPTO_clear_free () from /lib64/libcrypto.so.1.1
#17 0x00007f9e0cbff221 in BUF_MEM_free () from /lib64/libcrypto.so.1.1
#18 0x00007f9e0cbe336a in mem_free () from /lib64/libcrypto.so.1.1
#19 0x00007f9e0cbdc49a in BIO_free () from /lib64/libcrypto.so.1.1
#20 0x00007f9e0cbdcfdc in BIO_free_all () from /lib64/libcrypto.so.1.1
#21 0x00007f9e0c49ed43 in SSL_free () from /lib64/libssl.so.1.1
#22 0x000055b9da4f1662 in close__g8nGs1aFQjSyKaCvnSGH1w (socket=0x7f9e0bb3e030) at /nim/lib/pure/asyncnet.nim:699                                                                                                
#23 0x000055b9da4fb8d3 in close__h9bWTiUUZLlQgLXHDtjyONA (client=0x7f9e0bbbc2e8) at /nim/lib/pure/httpclient.nim:619                                                                                             
#24 0x000055b9da52f5d3 in reqEthAddressDetailsIter__ioevzQT4OCMuTzRWSwCM6g (ClE_0=0x7f9e0bbb2258) at /tmp/project/src/crypto_address_watcher.nim:22                                                              
#25 0x000055b9da530262 in reqEthAddressDetailsNimAsyncContinue__Z9cQmJ9aPol6ErCmKfu4D3gw (ClE_0=0x7f9e0ba60f98) at /nim/lib/pure/asyncmacro.nim:34                                                               
#26 0x000055b9da4ec915 in processPendingCallbacks__xdICRvJ1ns9byYHPjM07uJQ (p=p@entry=0x7f9e0d7b1ae8, didSomeWork=didSomeWork@entry=0x7ffc07070f5f "\001H\024\a\a\374\177")                                      
    at /nim/lib/pure/asyncdispatch.nim:210
#27 0x000055b9da4ecb8c in runOnce__nEnn7ilZ6aSkcXTtLEKnJA (timeout=timeout@entry=500) at /nim/lib/pure/asyncdispatch.nim:1340                                                                                    
#28 0x000055b9da4ecf99 in poll__wcvh7vf5dGU1VYto9aiAnpA (timeout=timeout@entry=500) at /nim/lib/pure/asyncdispatch.nim:1576                                                                                      
#29 0x000055b9da4ed03f in waitFor__zz9aVsJN0aTuFUXvzKrFo9cA (fut=0x7f9e0bb01288) at /nim/lib/pure/asyncdispatch.nim:1886                                                                                         
#30 0x000055b9da538758 in caw__mMsoXFdgrPsk9aq0MM3yjQw (timeoutInSeconds=<optimized out>, ethAddressList=0x7f9e0d785d48, afterEthBlock=10622183, btcAddressList=0x7f9e0d785cc8, afterBtcBlock=620000,            
    alertSlack=1 '\001', alertAlexa=1 '\001', alertNotify=0 '\000', healthCheckUrl=0x7f9e0d786108, customCmd=0x0, retry=1 '\001', requestDelay=500) at /tmp/project/src/crypto_address_watcher.nim:189           
#31 0x000055b9da53bd09 in dispatchcaw__CzSmPHCYzo7ii1SVC5sbZQ (cmdline=0x7f9e0d7b7120, usage=<optimized out>, prefix=<optimized out>, parseOnly=<optimized out>, skipHelp=<optimized out>,                       
    noHdr=<optimized out>, ClE_0=0x7f9e0d7b7048) at /root/.nimble/pkgs/cligen-1.1.0/cligen.nim:703
#32 0x000055b9da53be4f in cligenDoNotCollideWithGlobalVar__pv2uo9c25a0IVl6NgMfudlA (cfX60gensym41140218_=0x55b9da596100 <clCfg__POT08z3OfOElyZB6YNbrng>) at /root/.nimble/pkgs/cligen-1.1.0/cligen.nim:741       
#33 0x000055b9da53c04a in NimMainModule () at /tmp/project/src/crypto_address_watcher.nim:251
#34 0x000055b9da53ee34 in NimMain () at /root/.nimble/pkgs/serialization-0.1.0/serialization/errors.nim:96                                                                                                       
#35 0x000055b9da4bc89e in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /root/.nimble/pkgs/serialization-0.1.0/serialization/errors.nim:103                                          

logs

Aug 08 21:54:14 *hostname* caw[423968]: TRC 2020-08-08 21:54:14-04:00 new async http client up                   tid=423968 client="ref 0x7f9e0bbbc208 --> [socket = nil,\nconnected = false,\ncurrentU"    
Aug 08 21:54:14 *hostname* caw[423968]: TRC 2020-08-08 21:54:14-04:00 new async http client up                   tid=423968 client="ref 0x7f9e0b819f28 --> [socket = nil,\nconnected = false,\ncurrentU"    
Aug 08 21:54:15 *hostname* caw[423968]: TRC 2020-08-08 21:54:15-04:00 closing async http client                  tid=423968 client="ref 0x7f9e0bbbc2e8 --> [socket = ref 0x7f9e0bb3e030 --> [fd = 3,\n"     
Aug 08 21:54:15 *hostname* caw[423968]: Traceback (most recent call last)
Aug 08 21:54:15 *hostname* caw[423968]: crypto_address_watcher.nim crypto_address_watcher
Aug 08 21:54:15 *hostname* caw[423968]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 08 21:54:15 *hostname* caw[423968]: cligen.nim               dispatchcaw
Aug 08 21:54:15 *hostname* caw[423968]: crypto_address_watcher.nim caw
Aug 08 21:54:15 *hostname* caw[423968]: asyncdispatch.nim        waitFor
Aug 08 21:54:15 *hostname* caw[423968]: asyncdispatch.nim        poll
Aug 08 21:54:15 *hostname* caw[423968]: asyncdispatch.nim        runOnce
Aug 08 21:54:15 *hostname* caw[423968]: asyncdispatch.nim        processPendingCallbacks
Aug 08 21:54:15 *hostname* caw[423968]: asyncmacro.nim           reqEthAddressDetailsNimAsyncContinue                                                                                                       
Aug 08 21:54:15 *hostname* caw[423968]: asyncmacro.nim           reqEthAddressDetailsIter
Aug 08 21:54:15 *hostname* caw[423968]: httpclient.nim           close
Aug 08 21:54:15 *hostname* caw[423968]: asyncnet.nim             close
Aug 08 21:54:15 *hostname* caw[423968]: SIGSEGV: Illegal storage access. (Attempt to read from nil?)                                                                                                        
Aug 08 21:54:18 *hostname* caw[423968]: TRC 2020-08-08 21:54:18-04:00 closing async http client                  tid=423968 client="ref 0x7f9e0bbbc128 --> [socket = ref 0x7f9e0d71d030 --> [fd = 11,"      
Aug 08 21:54:18 *hostname* caw[423968]: TRC 2020-08-08 21:54:18-04:00 closed async http client                   tid=423968 client="ref 0x7f9e0bbbc128 --> [socket = ref 0x7f9e0d71d030 --> [fd = 11,"      
Aug 08 21:54:18 *hostname* caw[423968]: TRC 2020-08-08 21:54:18-04:00 closing async http client                  tid=423968 client="ref 0x7f9e0b819f28 --> [socket = ref 0x7f9e0b9ee030 --> [fd = 13,"      
Aug 08 21:54:19 *hostname* systemd[1]: caw.service: Main process exited, code=dumped, status=11/SEGV                                                                                                        
Aug 08 21:54:19 *hostname* systemd[1]: caw.service: Failed with result 'core-dump'.

This seems to be caused by a double close. I've PR-ed a (hack?) fix in #15174. A proper fix will be done once I got some insight into some weird design within asyncnet.

Was this page helpful?
0 / 5 - 0 ratings