Hi there!
After updating my Kemal web-app on Heroku with Crystal 0.31.0 I started to get a lot of error messages Index out of bounds (IndexError).
Minimal reproducible code:
# main.cr
require "http/server"
server = HTTP::Server.new([
HTTP::LogHandler.new
]) do |context|
context.response.content_type = "text/plain"
context.response.print "Hello world!\n" * 1000 # << HERE
end
puts "Listening on http://0.0.0.0:3000"
server.listen(3000)
It is reproducible only with --release flag and large body in response.
crystal build --release main.cr
Each request causes an error
GET / - Unhandled exception:
Index out of bounds (IndexError)
from /snap/crystal/118/share/crystal/src/crystal/thread_local_value.cr:26:7 in 'unbuffered_write'
from /snap/crystal/118/share/crystal/src/http/server/response.cr:217:11 in 'unbuffered_write'
from /snap/crystal/118/share/crystal/src/http/server/handler.cr:0:3 in 'call'
from /snap/crystal/118/share/crystal/src/http/server/request_processor.cr:48:11 in '->'
from /snap/crystal/118/share/crystal/src/fiber.cr:255:3 in 'run'
from ???
P.S.
crystal -v
Crystal 0.31.0 [8d4715a59] (2019-09-24)
LLVM: 8.0.0
Default target: x86_64-unknown-linux-gnu
BTW, can't reproduce on macOS.
crystal -v
Crystal 0.31.0 (2019-09-23)
LLVM: 8.0.1
Default target: x86_64-apple-macosx
Notice it's LLVM: 8.0.1 on MacOS but LLVM: 8.0.0 on Linux.
llvm 8.0.0 or 8.0.1 makes no difference. But the release mode does as @mamantoha said.
The error is also present in llvm 4. I checked since that was updated recently on the linux builds.
I don't get it still. My guess is that there is something odd with the state of the Slice somewhere.
dprintf along the way the problems is solved 卢卢. Patchraise in check_size the problem is also solved 卢卢. Patch@[NoInline] in that method didn't solved the issue.Any other guesses? Otherwise, maybe the second patch is a workaround for now. It is only affecting the path that will raise an exception...
@bcardiff Is this an error on Crystal's side? It only happens in linux? Is it related to some LLVM version?
I鈥檓 not sure.
I didn鈥檛 bisect on older crystal versions.
I checked llvm-ir (Everything seems reasonable there), but I didn鈥檛 dare to check asm. It already felt as an escape room.
If I'm seeing the stack trace correctly, the error might be triggering here:
I would try adding, just before that line:
LibC.printf("slice.size: %d written: %d\n", slice.size, bytes_written.to_i!)
and seeing if there's anything strange there.
(I can't try it because on mac it seems to work fine)
From there it goes to Slice#+ which call check_size.
I could reproduce it on docker.
From there it goes to Slice#+ which call check_size.
And that's where the exception happens, right?
My idea is to put debug info before that call to see if there's something wrong. Maybe bytes_written is something that's not -1 but also something that's not good, I don't know.
Would you mind checking to see if:
# https://github.com/crystal-lang/crystal/blob/master/src/crystal/thread_local_value.cr
struct Crystal::ThreadLocalValue(T)
def get(&block : -> T)
th = Thread.current
if !@values[th]?
@values[th] = yield
else
@values[th]
end
end
end
require "http/server"
server = HTTP::Server.new([
HTTP::LogHandler.new
]) do |context|
context.response.content_type = "text/plain"
context.response.print "Hello world!\n" * 1000 # << HERE
end
puts "Listening on http://0.0.0.0:3000"
server.listen(3000)
"fixes" the issue?
@bcardiff Could you give me step by step instructions to be able to reproduce the bug? I'm not super familiar with Docker not with which Dockerfile I should be using. I'd like to try to debug the problem and see if I can find something.
@omarroth I can confirm this "fixes" an issue
@mamantoha Could you also try this? It's slightly different than the other comment (which is not thread-safe).
# https://github.com/crystal-lang/crystal/blob/master/src/crystal/thread_local_value.cr
struct Crystal::ThreadLocalValue(T)
def get(&block : -> T)
@mutex.sync do
th = Thread.current
@values.fetch(th) do
@values[th] = yield
end
end
end
end
require "http/server"
server = HTTP::Server.new([
HTTP::LogHandler.new
]) do |context|
context.response.content_type = "text/plain"
context.response.print "Hello world!\n" * 1000 # << HERE
end
puts "Listening on http://0.0.0.0:3000"
server.listen(3000)
@asterite that does not fix the issue. I confirm that would seem to be gone by @omarroth patch but also with the following
struct Crystal::ThreadLocalValue(T)
def get(&block : -> T)
th = Thread.current
@mutex.sync do
if !@values[th]?
@values[th] = yield
else
@values[th]
end
end
end
end
my guess is that the double lookup in the hash is making things slightly slower and hiding the underlying issue.
@asterite to reproduce it from mac
$ docker run -it --rm --privileged --pid=host -v $(pwd):/src -w /src crystallang/crystal:0.30.1 bash
> apt-get update && apt-get install -y software-properties-common lsb-release wget curl
> curl https://apt.llvm.org/llvm.sh -o llvm.ignoreme.sh
> chmod +x llvm.ignoreme.sh
> ./llvm.ignoreme.sh 8
> make clean clean_cache crystal
> ./bin/crystal run --release http.ignoreme.cr
$ docker exec -it $(docker ps -q) bash
> curl http://localhost:3000
(sorry I have a global *.ignoreme.* rule)
I can reproduce it directly from 0.31.0 image also.
Using git bisect, 19d2a44b40039350567c15def4b2825c02c9bb58 is the first commit where I'm able to produce the issue.
However, it still occurs when reverting the above commit on master, so I don't expect that it actually produced a regression. As @bcardiff suggested the actual issue is likely hidden by various tweaks to performance.
Most helpful comment
llvm 8.0.0 or 8.0.1 makes no difference. But the release mode does as @mamantoha said.
The error is also present in llvm 4. I checked since that was updated recently on the linux builds.