Crystal: Crystal hangs/freezes when fetching HTTPS urls using fibers

Created on 22 Jun 2016 · 22Comments · Source: crystal-lang/crystal

Crystal totally freezes up after a while of fetching HTTPS urls using fibers.

I've successfully reproduced this both using default OS X OpenSSL (OpenSSL 0.9.8zh 14 Jan 2016) as well as latest OpenSSL installed via Homebrew (OpenSSL 1.0.2h 3 May 2016).

Details:

Crystal compiled from source:
$ crystal version
Crystal 0.18.4+57 [bab00a9] (2016-06-22)

OpenSSL 1.0.2:
$ openssl version
OpenSSL 1.0.2h 3 May 2016
$ pkg-config --libs libssl
-L/usr/local/Cellar/openssl/1.0.2h_1/lib -lssl

OpenSSL 0.9.8:
$ openssl version
OpenSSL 0.9.8zh 14 Jan 2016
$ pkg-config --libs libssl
-lssl -lcrypto -lz

How to reproduce:

Update:

require "http/client"
HTTP::Client.get("https://www.youtube.com/watch")

Previously:
$ git clone git://github.com/Agiley/fiber-freeze.git && cd fiber-freeze
$ crystal examples/simple_https_with_fibers.cr
Source: examples/simple_https_with_fibers.cr

For me it'll usually take 1 - 2 minutes before it freezes up. The last output in the console is typically "Fetching https://www.reddit.com/wiki/reddiquette/..." for OpenSSL 1.0.2h and for OpenSSL 0.9.8zh it usually gets stuck around 1-3 urls after that specific url is printed.

I've let the script run for several hours and overnight last night, thinking maybe it would get unstuck eventually - which it unfortunately didn't.

Looking at the memory consumption by the script process it looks like the amount of allocated memory is constant after it freezes, like the GC isn't able to reclaim memory or something?

There doesn't seem to be an issue with regular non-SSL urls, I've tested running a script that fetches ~10k urls without freezing up ($ crystal examples/http_with_fibers.cr).

I've also successfully reproduced this issue whilst running my crawler using Sidekiq.cr, eventually it will stop processing new jobs and just get stuck, requiring me to force kill the process (CTRL + C doesn't work).

bug stdlib

Source

SebastianJ

👍1

Most helpful comment

The issue seems to be now that the slice to read is empty, so avail_out is zero, and of course it won't read any bytes. Can be tested though, I'll push a fix soon. @Agiley Thank you for these tests, they identify real, important bugs :-)

asterite on 22 Jun 2016

🎉3

All 22 comments

Backtrace when stuck:

#0  0x0000000000472205 in *Zlib::Inflate#read<Slice(UInt8)>:Int32 ()
#1  0x000000000047b552 in *IO::Decoder#read<Zlib::Inflate>:(Int32 | Nil) ()
#2  0x0000000000473e21 in *Zlib::Inflate@IO#gets_to_end:String ()
#3  0x000000000049a0c7 in *HTTP::Client::Response#consume_body_io:Nil ()
#4  0x00000000004998f6 in *HTTP::Client::Response::from_io<(OpenSSL::SSL::Socket+ | TCPSocket+), Bool, Bool>:HTTP::Client::Response ()
#5  0x000000000048dd39 in *HTTP::Client#exec_internal<HTTP::Request>:HTTP::Client::Response ()
#6  0x000000000048db76 in *HTTP::Client#exec<HTTP::Request>:HTTP::Client::Response ()
#7  0x000000000048d887 in *HTTP::Client#exec<String, String, HTTP::Headers, Nil>:HTTP::Client::Response ()
#8  0x000000000048d84d in *HTTP::Client#get:headers<String, HTTP::Headers>:HTTP::Client::Response ()
#9  0x0000000000441e94 in ~procProc(Nil)@./examples/simple_https_with_fibers.cr:15 ()
#10 0x0000000000449704 in *Fiber#run:(Int64 | Nil) ()
#11 0x000000000043d761 in ~procProc(Fiber, (Int64 | Nil))@/home/jhass/projects/crystal/code/src/fiber.cr:27 ()
#12 0x0000000000000000 in ?? ()

Are we sure it's actually TLS related? I mean the set of URLs becomes quite distinct at that point, so it might just be a response that trips up which just isn't in the set of HTTP URLs.

That said it's entirely possible that we do something incorrectly in OpenSSL::SSL::Socket, like not giving up control of the fiber in some situations where we should, or vice versa.

jhass on 22 Jun 2016

I'm not sure it's actually TLS, it's just a guess.

I disabled crawling of HTTPS urls in my crawler and it kept on running. When I re-enabled crawling of HTTPS urls it eventually got stuck again. Running the $ crystal examples/http_with_fibers.cr script also successfully fetches 10k+ http urls without hanging (at least for me).

If it gets stuck because of TLS or if it's because of one/many SSL urls responding with weird responses (unrelated to TLS) I'm not sure.

SebastianJ on 22 Jun 2016

Okay it gets into a busy loop with ret == LibZ::Error::BUF_ERROR and read_bytes == 0 here: https://github.com/crystal-lang/crystal/blob/master/src/zlib/inflate.cr#L94

The following patch makes it not get it stuck:

diff --git a/src/zlib/inflate.cr b/src/zlib/inflate.cr
index 44cf2f7..55b260f 100644
--- a/src/zlib/inflate.cr
+++ b/src/zlib/inflate.cr
@@ -86,7 +86,8 @@ class Zlib::Inflate
       case ret
       when LibZ::Error::NEED_DICT,
            LibZ::Error::DATA_ERROR,
-           LibZ::Error::MEM_ERROR
+           LibZ::Error::MEM_ERROR,
+           LibZ::Error::BUF_ERROR
         raise Zlib::Error.new(ret, @stream)
       when LibZ::Error::STREAM_END
         return read_bytes

However I'm not sure it's the correct solution. @asterite last touched this and basically made the busy loop possible in this situation with 0c03e8b5a0de466e845b806d58fe391272fffc71, so maybe he has some more insight.

jhass on 22 Jun 2016

The offending URL for me seems to be https://www.youtube.com/watch?v=GsyjNef2ydQ, but

require "http/client"

r = HTTP::Client.get("https://www.youtube.com/watch?v=GsyjNef2ydQ")
p r

does not reproduce. And given you see different offenders on OpenSSL 0.9.8 vs 1.0.2, some bad interaction with OpenSSL::SSL::Socket still isn't out of question.

jhass on 22 Jun 2016

Relevant zlib resources: http://www.zlib.net/zlib_how.html and http://www.zlib.net/zpipe.c

ysbaddaden on 22 Jun 2016

So we just shouldn't retry with BUFF_ERROR and read_bytes == 0? Like this rather?

diff --git a/src/zlib/inflate.cr b/src/zlib/inflate.cr
index 44cf2f7..ce78e47 100644
--- a/src/zlib/inflate.cr
+++ b/src/zlib/inflate.cr
@@ -88,6 +88,14 @@ class Zlib::Inflate
            LibZ::Error::DATA_ERROR,
            LibZ::Error::MEM_ERROR
         raise Zlib::Error.new(ret, @stream)
+      when LibZ::Error::BUF_ERROR
+        if read_bytes == 0
+          # The last iteration could perfectly fill the buffer, so this iteration
+          # could not produce any more content and we're done.
+          return slice.size
+        else
+          raise Zlib::Error.new(ret, @stream)
+        end
       when LibZ::Error::STREAM_END
         return read_bytes
       else

Mmmh, no, that sefaults.

jhass on 22 Jun 2016

Yes, seems like https://www.youtube.com/watch?v=GsyjNef2ydQ is the offending URL.

Managed to create a more isolated test case:
https://gist.github.com/Agiley/b329df14ef46b913ba7f05a0b5c90355

The weird thing is that it only happens on the second batch where HTTP::Client is instantiated and having timeouts set.

SebastianJ on 22 Jun 2016

Timeouts are interesting: could the timeout be reached, but zlib kept waiting for more input data to process, that will never come? Ie. the socket isn't closed properly in OpenSSL?

ysbaddaden on 22 Jun 2016

There are likely two issues here, yes. The first is that Zlib::Inflate#read shouldn't be able to run into a busy loop. The second issue is whatever sets the precedent for the edge case to happen in Zlib::Inflate#read.

jhass on 22 Jun 2016

Simplified it even more: https://gist.github.com/Agiley/8562d2a5218f8e6553bb5fbc20e61bb2

Seems like timeouts didn't have any effect.

Calling HTTP::Client.get(url) works, but:

uri = URI.parse(url)
path = uri.path.try &.empty? ? "/" : uri.path.as(String)
HTTP::Client.new(uri).get(path)

does not.

Also, forcing HTTP::Client to close the underlying socket doesn't affect anything either, the script still freezes up:

client          =   HTTP::Client.new(uri)
client.get(path)
client.close

SebastianJ on 22 Jun 2016

It's actually https://www.youtube.com/watch that is the offending url.

I managed to forget that uri.path doesn't include the querystring component, so the second example effectively calls https://www.youtube.com/watch rather than https://www.youtube.com/watch?v=GsyjNef2ydQ like the first example does, hence why the first batch works but the second one doesn't.

Now both examples fail:

Example 1:

require "http/client"
url = "https://www.youtube.com/watch"
ch = Channel(Nil).new

100.times do
  spawn do
    puts "#{Time.now}: Fetching #{url}..."
    HTTP::Client.get(url)
    ch.send(nil)
  end
end

100.times do
  ch.receive
end

puts "\n\n\nSuccessfully fetched url using HTTP::Client.get(url).\n\n\n"

Example 2:

require "http/client"
url = "https://www.youtube.com/watch"
ch2 = Channel(Nil).new

100.times do
  spawn do
    puts "#{Time.now}: Fetching #{url}..."
    uri = URI.parse(url)
    path = uri.path.try &.empty? ? "/" : uri.path.as(String)
    HTTP::Client.new(uri).get(url) # Doesn't matter if we use .get(path) or .get(url), both will fail
    ch2.send(nil)
  end
end

100.times do
  ch2.receive
end

puts "\n\nSuccessfully fetched url using HTTP::Client.new(uri).get(path)."

Update:
Even simpler example:

require "http/client"
HTTP::Client.get("https://www.youtube.com/watch")

SebastianJ on 22 Jun 2016

I think we are missing a check at the end of this if, to see if @stream.avail_in is still zero: https://github.com/crystal-lang/crystal/blob/master/src/zlib/inflate.cr#L76-L79

asterite on 22 Jun 2016

This works for me:

require "http/client"

class Zlib::Inflate
  def read(slice : Slice(UInt8))
    check_open

    while true
      if @stream.avail_in == 0
        @stream.next_in = @buf.to_unsafe
        @stream.avail_in = @input.read(@buf.to_slice).to_u32
        return 0 if @stream.avail_in == 0
      end

      @stream.avail_out = slice.size.to_u32
      @stream.next_out = slice.to_unsafe

      ret = LibZ.inflate(pointerof(@stream), LibZ::Flush::NO_FLUSH)
      read_bytes = slice.size - @stream.avail_out
      case ret
      when LibZ::Error::NEED_DICT,
           LibZ::Error::DATA_ERROR,
           LibZ::Error::MEM_ERROR
        raise Zlib::Error.new(ret, @stream)
      when LibZ::Error::STREAM_END
        return read_bytes
      else
        # LibZ.inflate might not write any data to the output slice because
        # it might need more input. We can know this happened because `ret`
        # is not STREAM_END.
        if read_bytes == 0
          next
        else
          return read_bytes
        end
      end
    end
  end
end

HTTP::Client.get("https://www.youtube.com/watch").to_io(STDOUT)

@Agiley can you check if all the URLs you have work with this change? I believe they should as the check looks correct, but a second check is always good :-)

asterite on 22 Jun 2016

We can also write a spec for that 🎉

it "should not freeze on empty stream" do
  io = MemoryIO.new("")
  inflate = Inflate.new(io)
  inflate.read_byte.should be_nil
end

which currently freezes...

asterite on 22 Jun 2016

🎉3

@asterite: That fix enables the crawler to continue onwards from where it previously got stuck.

It still gets stuck though.

Reproduce:
$ git clone git://github.com/Agiley/fiber-freeze.git && cd fiber-freeze
$ crystal examples/simple_https_with_fibers.cr

Still trying to pinpoint the exact url that freezes it. What I can see so far is that some pages are encoded with "text/html; charset=utf-8" and some "text/html; charset=windows-1251".

SebastianJ on 22 Jun 2016

@Agiley You are right.

I'd try to simplify the examples, though. If you run it sequentially it gets stuck on "https://www.gazeta.ru/culture/2016/06/21/a_8319095.shtml", so easier to reproduce:

require "http/client"

url = "https://www.gazeta.ru/culture/2016/06/21/a_8319095.shtml"
p 1
HTTP::Client.get(url).to_io(STDOUT)
p 2

asterite on 22 Jun 2016

Yes, my intention was to identify where it got stuck and then make simpler examples (like with HTTP::Client.get("https://www.youtube.com/watch")). I had some instances where 1-2 urls (both from Russian sites) would in one case work, and in another case get stuck.

Can verify that

require "http/client"
url = "https://www.gazeta.ru/culture/2016/06/21/a_8319095.shtml"
p 1
HTTP::Client.get(url).to_io(STDOUT)
p 2

freezes on my computer as well.

SebastianJ on 22 Jun 2016

asterite on 22 Jun 2016

🎉3

Reopening because even though it fixes that problem, it seems to still hang on some other URL...

asterite on 22 Jun 2016

Ah, no, I still had the patch for the old fix. It finished now for me :-)

In case you want to try, the final patch is:

class Zlib::Inflate
  def read(slice : Slice(UInt8))
    check_open

    return 0 if slice.empty?

    while true
      if @stream.avail_in == 0
        @stream.next_in = @buf.to_unsafe
        @stream.avail_in = @input.read(@buf.to_slice).to_u32
        return 0 if @stream.avail_in == 0
      end

      @stream.avail_out = slice.size.to_u32
      @stream.next_out = slice.to_unsafe

      ret = LibZ.inflate(pointerof(@stream), LibZ::Flush::NO_FLUSH)
      read_bytes = slice.size - @stream.avail_out
      case ret
      when LibZ::Error::NEED_DICT,
           LibZ::Error::DATA_ERROR,
           LibZ::Error::MEM_ERROR
        raise Zlib::Error.new(ret, @stream)
      when LibZ::Error::STREAM_END
        return read_bytes
      else
        # LibZ.inflate might not write any data to the output slice because
        # it might need more input. We can know this happened because `ret`
        # is not STREAM_END.
        if read_bytes == 0
          next
        else
          return read_bytes
        end
      end
    end
  end
end

asterite on 22 Jun 2016

Ok, I'll let it run through the full set of urls and report back if it gets stuck again.

SebastianJ on 22 Jun 2016

All right, it went through 12k+ urls without freezing, great stuff! Thanks for the fixes!

SebastianJ on 22 Jun 2016

🎉2

Was this page helpful?

0 / 5 - 0 ratings