openssl 🚀 - New EOF detection breaks session resumption

Both peers should always send the close notify (call
SSL_shutdown), but then have 2 options:
1) Wait to receive the peer's close notify
2) Because of the protocol, know that the only thing that can come
now is the close notify and not wait for it.

My understanding is that you try to do 1), but the server is
failing to do it's part and is not sending the close notify.

At the SSL layer, we don't know that the communication is done or
not, so if we don't receice the close notify, but do detect an EOF,
we need to return an error. If we do not return an error, you
might be vulnerable to a truncation attack.

Because you don't know that the other side is going to do 1) or
2), you should always send the close notify.

So I think you now have those options:

Fix the server to send the close notify
Don't wait for the close notify

kroeckx on 22 Mar 2020

It's worth reminding that openssl s_client/s_server do not provide a relevant example. I agree that the broken one in this pair is s_server, but it's worth fixing it.

beldmit on 22 Mar 2020

@beldmit This is not just a s_server problem. We found broken resumption using nginx as well.

@kroeckx While you are right that the server should send a close notify, many servers don't. So "fix the server" is not something we can do. I did a quick test against the Alexa Top 50 and the rate of successful resumptions using bidirectional shutdown went from 70% using 1.1.1d to 30% using 1.1.1e. If SSL_read is going to signal an error, at least don't invalidate the session while doing so.

As a general note, this kind of breaking change is not something I would have expected in a bugfix release.

roussosalex on 22 Mar 2020

👍2

As a general note, this kind of breaking change is not something I would have expected in a bugfix release.

I did not expect software to ignore that we already returned an
error. Looking at some code, they seem to have special cased that
error and decided to just ignore the error, possibly opening
themselves for a truncation attack.

I think it was actually well known that HTTPS is broken in this
regard, that many servers do not properly close the connection
while they should. The only recommendation I have for that is to
not call SSL_read() when you know that you have already received
everything, or that you ignore it in that case you have received
everything. If you don't know that everything has been received,
and you don't get a close notify, you really should get an error.

I'm currently unsure if we should revert this or not. There
probably is code where because of the protocol it's clear that
everything was send or not, and such code worked properly before
if they had a special case for that error. But all other code
that just ignores it should get fixed instead. As I understand it,
HTTP 1.1 is not always such a protocol. So I guess I'm waiting to
see examples of non-broken code that is affected by the change.

kroeckx on 23 Mar 2020

We can consider revering it in 1.1.1 but keeping it in master/3.0.
But I'm currently still unsure what to do, we want to encourage
people to fix this. And it seems non-trivial for code that
actually knows all data was received and can't avoid calling
SSL_read() to ignore the error.

kroeckx on 23 Mar 2020

👍1

I'm currently unsure if we should revert this or not.

This seems like the best solution for the near future until we can find a better way to handle the EOF issue.

And it seems non-trivial for code that actually knows all data was received and can't avoid calling SSL_read() to ignore the error.

We would love to ignore the error, except that SSL_read now calls SSLfatal, which invalidates the session attached to it and makes it not resumable. We have no direct control over the validity of the session and cannot predict if a server will send a close notify or not.

If this problem persists, we can only switch to unidirectional shutdown and hope that the issue gets fixed in a future release.

roussosalex on 23 Mar 2020

👍1

I'm currently unsure if we should revert this or not.

We effectively changed an error from SSL_ERROR_SYSCALL to SSL_ERROR_SSL. This is correct because our own documentation says to go check errno if you get the former because there's been a system level IO error, or go check the OpenSSL error stack with the latter because there's been some TLS level problem. The problem really is at the TLS level, and errno is 0 in this case so returning SSL_ERROR_SYSCALL is just incorrect behaviour. Since it was a bug fix it met the requirements for backport to 1.1.1

Since we are swapping one type of error for another, one might have expected it to not be too big a deal. Nonetheless, I had a suspicion when I opened the 1.1.1 backport PR this that some code might find this change unexpected. For that reason I requested multiple approvals but still the consensus seemed to be at that time that we should still backport.

As always with bug fixes, one person's bug fix is another person's breaking change, if they were relying on the buggy behaviour. The problem now is having made the decision to backport does it just confuse things further to revert it...only to reintroduce it again in master? I'm also unsure as to the correct answer to this.

mattcaswell on 23 Mar 2020

I think, at first, it should be fixed in openssl code itself - at least s_client/s_server pair should be a relevant example.

beldmit on 23 Mar 2020

A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case?

t8m on 23 Mar 2020

I've got some private comments from the Nginx team. They are very unhappy, especially speaking about broken session resumption. If they provide some more details, I'll resend them to the project list.

beldmit on 23 Mar 2020

http://mailman.nginx.org/pipermail/nginx/2020-March/059175.html

beldmit on 23 Mar 2020

A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case?

I think we should do that.

kroeckx on 23 Mar 2020

http://mailman.nginx.org/pipermail/nginx/2020-March/059175.html

Note that #11381 talks about that.

kroeckx on 23 Mar 2020

A compromise could be to not break the session resumption if unexpected EOF is obtained

That would fix at least this issue, but the other EOF issues persist.

roussosalex on 23 Mar 2020

I would like to suggest adding an option to SSL_set_options to switch between the old and new EOF handling, which would be set to "old" by default. This would solve all EOF related issues for software relying on the old behavior while keeping the option for anyone concerned about #10880.

roussosalex on 24 Mar 2020

👍2

In practice I'm not sure how many people would actually use that option. I'm wondering whether a better solution is to keep this fix in master, but revert it in 1.1.1 and just document it as a known bug.

mattcaswell on 24 Mar 2020

👍1

I think reverting in 1.1.1 is the best option.

beldmit on 24 Mar 2020

I am afraid that this is the most reasonable way for 1.1.1.

And still in the master it would be worth allowing the resumption after the unexpected EOF unless we have a very good reason not to do that.

t8m on 24 Mar 2020

👍1

If you keep it in 1.1.1 (no comment on that), please add a big CHANGES item that says this is broken and is changing in the next release, and point to a FAQ entry that explains what to do.

richsalz on 24 Mar 2020

👍1

Aren't we out of bits in the SSL_OP_ namespace in 1.1.1 anyway?

kaduk on 24 Mar 2020

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

bagder on 25 Mar 2020

👍1

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

I agree with that. However, if things are breaking all around because some servers have chosen to ignore the previous error code (see what @mattcaswell wrote higher up), the fix may have more current casualty than anyone is comfortable with. In an ideal world, all identified bugs would be fixed and rolled out immediately... unfortunately, that's simply not realistic.

I'm not saying which way we should go, I'm frankly undecided on this, and these bits are not my forte, but I agree with @richsalz that if we decide to revert the change in 1.1.1, that reversal should come with prominent documentation, so people have a chance to see that there is an issue, and time to fix their software until they start tackling 3.0.

levitte on 25 Mar 2020

I think we're all agreeing that it is a bug, and that it should be
fixed. It just that other software doesn't seem to be ready to be
to fix it without breaking too much.

I think we should at least try to find which implementations of
https servers don't do this, and try to get those implementations
fixed.

kroeckx on 25 Mar 2020

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

IMO there is no doubt that this is a bug. However - all bug fixes are behaviour changes. Normally we hope that the behaviour change introduced by a bug fix into a stable branch is desirable because no one wants the old incorrect behaviour. However, every now and then we come across a bug like this one. The old behaviour has been there for so long that other software has been written to expect the incorrect behaviour.

Since 1.1.1 is supposed to be stable, and this has broken stuff, it seems that the correct solution is to revert it. However, 3.0 is a major release. We are trying to keep breaking changes in 3.0 to an absolute minimum, but we do not rule them out entirely. Software authors should reasonably expect to have to test that their software still works when upgrading to 3.0, and might have to make some minor changes in some cases. So, it still seems to me entirely appropriate to fix this bug in that release.

That said this has highlighted a couple of related problems which should be fixed in 3.0:

Session resumption breaking was unexpected - so we should look to fix that
s_server should also be fixed

It's entirely possible that there are other related fixes that we should do to minimise the impact. So it would be good if we could identify those.

mattcaswell on 25 Mar 2020

👍5

Can we at least do the revert in 1.1.1 soon? I can prepare PR for that.

t8m on 25 Mar 2020

👍1

Can we at least do the revert in 1.1.1 soon? I can prepare PR for that.

Please do!

mattcaswell on 25 Mar 2020

Does this imply that we need to do a 1.1.1f soon?

mattcaswell on 25 Mar 2020

❤2

I suppose so.
A related issue to solve in master is also the #11388

t8m on 25 Mar 2020

👍2

PR for the 1.1.1 revert: #11400

t8m on 25 Mar 2020

❤2

I agree that bugfixes are behavior changes, and that 3.0 is a major release. It would be unfortunate if, for example, it wasn't possible to buid and USE a FIPS-capable curl, for example, or set up a FIPS-capable nginx web server.

Please consider these things when you decide if the EOF detection stays in 3.0

richsalz on 25 Mar 2020

👍1

// bidirectional shutdown
if (SSL_shutdown(ssl) == 0)
    SSL_read(ssl, NULL, 0);

Why is SSL_shutdown() being followed by an SSL_read() of zero bytes? The standard way to do a bidirectional shutdown is call SSL_shutdown() again to wait for the peer's close-notify, but that is not necessary, it suffices to for each peer to send their close-notify, there's no need to wait to receive the close notify from the other side, if application-level framing already makes it clear that all data-transmission is done.

vdukhovni on 25 Mar 2020

I agree that bugfixes are behavior changes, and that 3.0 is a major release. It would be unfortunate if, for example, it wasn't possible to buid _and USE_ a FIPS-capable curl, for example, or set up a FIPS-capable nginx web server.

Please consider these things when you decide if the EOF detection stays in 3.0

I think the fix should stay in 3.0. Migration to 3.0 requires at least a rebuild/test of the code, and most applications should continue to work, and those that don't can be remediated to not be oblivious to truncation attacks.

The issue with introducing this into 1.1.1e is that applications tested with previous 1.1.x releases would now without additional build/test cycles be suddenly exposed to a library that presents a potentially unexpected and not entirely uncommon end-of-session behaviour. That's a higher level of risk than for recertification (build/test) of the application for 3.0, likely on a newer overall OS release, ...

Which is not to say that we couldn't discuss providing backwards-compatibility crutches for 3.0, but I'd like to know more about actual, rather than hypothetical impact. Which applications and which corner cases are running into this?

vdukhovni on 25 Mar 2020

👍2

After you've sent a shutdown, the peer can still send data, so you
can still receive data. If the peer did still send data,
SSL_shutdown() will actually generate an error, because you've
lost data. So the recommended way to do it is only call
SSL_shutdown() once and then call SSL_read() until it returns
SSL_ERROR_ZERO_RETURN. So calling SSL_read() with 0 doesn't make any
sense.

kroeckx on 25 Mar 2020

@vdukhovni
We want to be able to resume the session, so following the documentation:

In case the application wants to be able to resume the session, it is recommended to do a complete shutdown procedure (bidirectional close_notify alerts).

Which applications and which corner cases are running into this?

Any application doing bidirectional shutdown (apply attached patched file to s_client) with any affected server, e.g. google.com:

openssl s_client -connect google.com:443 -reconnect -tls1_2

Notice every connection being reported as New instead of Reused with 1.1.1e

s_client_bidirectional.patch.txt

roussosalex on 25 Mar 2020

After you've sent a shutdown, the peer can still send data, so you can still receive data. If the peer did still send data, SSL_shutdown() will actually generate an error, because you've lost data. So the recommended way to do it is only call SSL_shutdown() once and then call SSL_read() until it returns SSL_ERROR_ZERO_RETURN. So calling SSL_read() with 0 doesn't make any sense.

Correct. In my comment, I was tacitly assuming that the application protocol does not support half-close, and that at the point at which SSL_shutdown() was being called, neither side had anything more to send. In which one should call either SSL_shutdown() again, or as you point out keep reading and processing (perhaps discarding in a sufficiently odd corner case) residual data until the peer has nothing further to send. I don't understand what the SSL_read() of zero bytes was supposed to accomplish.

The code in question also does not handle WANT_READ/WANT_WRITE, etc. so I assume the sockets in question are not non-blocking... Otherwise there would need to be additional logic to retry the SSL_shutdown() call after waiting for the socket to drain, ... e.g. https://github.com/vdukhovni/postfix/blob/master/postfix/src/tls/tls_bio_ops.c#L192-L286

vdukhovni on 25 Mar 2020

@sam-github should we land https://github.com/termux/termux-packages/pull/5075 before releasing any versions of node with 1.1.1e?

edit: just noticed that this is not a PR against node.. oops. Should we do something similar in core?

MylesBorins on 25 Mar 2020

@MylesBorins There is no evidence from our test suites that Node.js itself is affected by this. From conversation above, it sounds like correct handling of errors is robust to this change. "could we" be affected? Of course, any change could possibly break someone, somewhere, but we don't have evidence of it yet.

The problem we had in our test suites is that the tests that spawn the openssl s_client/s_server CLIs and interact with them (to make sure we can talk to non-nodejs clients/servers) was affected by a bug in s_client, but that's not worth floating a patch for, s_client is only used internally, we don't ship it.

sam-github on 25 Mar 2020

KK, thanks for the heads up

On Wed, Mar 25, 2020 at 2:30 PM Sam Roberts notifications@github.com
wrote:

@MylesBorins https://github.com/MylesBorins There is no evidence from
our test suites that Node.js itself is affected by this. From conversation
above, it sounds like correct handling of errors is robust to this change.
"could we" be affected? Of course, any change could possibly break someone,
somewhere, but we don't have evidence of it yet.

The problem we had in our test suites is that the tests that spawn the
openssl s_client/s_server CLIs and interact with them (to make sure we can
talk to non-nodejs clients/servers) was affected by a bug in s_client, but
that's not worth floating a patch for, s_client is only used internally, we
don't ship it.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openssl/openssl/issues/11378#issuecomment-604011000,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AADZYV33RQVW3SXNMPVBOELRJJEVJANCNFSM4LREG3TA
.

MylesBorins on 25 Mar 2020

I think we should at least try to find which implementations of
https servers don't do this, and try to get those implementations
fixed.

I've been looking into this, checking various sites, among others
the alaxa top 50, and only found 4 companies that did not properly
send a close notify: Google, Facebook, Cloudflare, Microsoft. I've
contacted the first 3 of those without much luck so far.

kroeckx on 25 Mar 2020

Session resumption breaking was unexpected - so we should look to fix that

I've run into this before when writing tests -- if you don't call SSL_shutdown() before SSL_free(), your session gets removed from the internal cache. I'm highly confident this behavior was added for security posture reasons, as a potential truncation of the connection could have been the result of an attacker deliberately removing information that would have made it desirable to not use the session in the course of normal operation. An attempt to change the behavior here should be cognizant of that and do some more research into the scenarios involved.

kaduk on 26 Mar 2020

I've been looking into this, checking various sites, among others the alaxa top 50, and only found 4 companies that did not properly send a close notify: Google, Facebook, Cloudflare, Microsoft. I've contacted the first 3 of those without much luck so far.

In the case of HTTP does it matter whether the server sends close-notify? Provided each request receives a complete response, it is not clear that close-notify is needed. What breaks?

vdukhovni on 26 Mar 2020

Because there are HTTP responses where the close of the stream signifies the end of body so we need to be sure it was a clean shutdown.

bagder on 26 Mar 2020

In the case of HTTP does it matter whether the server sends close-notify? Provided each request receives a complete response, it is not clear that close-notify is needed. What breaks?

It should be possible to check it at the HTTP level, but each time
they tried to check it, they had to revert the changes because it
breaks too much. So we are forced to consider HTTP 1.1 to be a
protocol that is vulnerable to a truncation attack.

So instead of checking it at the HTTP level, we can check it at
the TLS level. But there are many servers that don't send the close
notify, and so it's also impossible to check for the truncation
attack at the TLS level.

The current result is that no browser has protection against a
truncation attack.

kroeckx on 26 Mar 2020

@kroeckx That is not quite correct. HTTP/1.1 has a defined terminator for headers (newline) and header blocks (blank line), and three ways to terminate response bodies.

Clients should not process unterminated header blocks. (Note there is some complication here because HTTP/0.9 is somehow not dead yet.) I believe Chrome enforces this over HTTPS (ERR_RESPONSE_HEADERS_TRUNCATED). There are then three ways to terminate response bodies:

If the response has Transfer-Encoding: chunked, each chunk has a length prefix and the whole thing is terminated by an empty chunk. The HTTP client should only report a clean end of stream if the length prefixes are counted up and there is an empty chunk. I believe Chrome enforces this (ERR_INCOMPLETE_CHUNKED_ENCODING).
If the response has a Content-Length header, that is the number of bytes to read. The HTTP client must count that many bytes before reporting a clean end of stream. I believe Chrome enforces this (ERR_CONTENT_LENGTH_MISMATCH).
If neither (1) nor (2) apply, the response is terminated by EOF.

Of these, (3) is the only one that depends on close_notify. Unfortunately, close_notify is a fiction so EOF-terminated response bodies are simply unsafe and HTTP/1.1 servers need to ensure they always include Content-Length if the length is known, and use chunked encoding if the length is unknown. (Or deploy HTTP/2 and then you're fine. Happily ALPN is downgrade-protected.) Note that, for responses that use (1) or (2), whether the server sends close_notify is irrelevant because the client should be rejecting an early termination regardless.

davidben on 26 Mar 2020

Of these, (3) is the only one that depends on close_notify. Unfortunately, close_notify is a fiction so EOF-terminated response bodies are simply unsafe and HTTP/1.1 servers need to ensure they always include Content-Length if the length is known, and use chunked encoding if the length is unknown. (Or deploy HTTP/2 and then you're fine. Happily ALPN is downgrade-protected.) Note that, for responses that use (1) or (2), whether the server sends close_notify is irrelevant because the client should be rejecting an early termination regardless.

Thanks for the detailed response, this is what I was trying to hint at, lack of close-notify should only affect HTTP/1.0 servers that stream data and can't provide a Content-Length, or poorly-implemented HTTP/1.1 servers that don't use chunked transfer encoding when they should.

I don't know how common HTTP < 1.1 is these days, nor how common it is for 1.1 servers to skimp to not do chunked encoding as required.

vdukhovni on 26 Mar 2020

The "poorly-implemented HTTP/1.1 servers" are still out there and are being used. How common? Impossible to say.

bagder on 26 Mar 2020

👍1

I was tacitly assuming that the application protocol does not support half-close, and that at the point at which SSL_shutdown() was being called, neither side had anything more to send. In which one should call either SSL_shutdown() again, or as you point out keep reading and processing (perhaps discarding in a sufficiently odd corner case) residual data until the peer has nothing further to send. I don't understand what the SSL_read() of zero bytes was supposed to accomplish.

People are following the documentation for return value 0 of SSL_shutdown().

Current OpenSSL documents:

Call SSL_read() to do a bidirectional shutdown.

OpenSSL before commit 00f561ab9c7 versions documents:

Call SSL_shutdown() again to do a bidirectional shutdown.

Both have have been part of the documentation of stable OpenSSL releases, so users can expect either of these options to work.

If ABI breaks happen by a change breaking past documented behaviour, please add a new version with a different name instead (e.g. SSL_shutdown2). Silently changing documented behaviour breaks users of OpenSSL, in the worst case silent breakage introducing security vulnerabilities.

Who at OpenSSL is responsible for ensuring that current implementation still matches the behaviour in the documentation of older OpenSSL releases?

AdrianBunk on 12 Apr 2020

I think there is a misunderstanding here. The problem is when
the other side of the connection does not call SSL_shutdown(). We
have always generated an error an error for that, but 1.1.1e
generated a different error.

kroeckx on 12 Apr 2020

I think there is a misunderstanding here. The problem is when the other side of the connection does not call SSL_shutdown().

You wrote earlier in this discussion:

The only recommendation I have for that is to not call SSL_read() when you know that you have already received everything, or that you ignore it in that case you have received everything.

You changed the documentation of SSL_shutdown() to read in current stable OpenSSL releases:

Call SSL_read() to do a bidirectional shutdown.

The problem is not the other side of the connection, the problem is that you are breaking working applications that have been coded following your documentation.

We have always generated an error an error for that, but 1.1.1e generated a different error.

If this breaks behaviour documented anywhere in the documentation of any released OpenSSL version, then this might require a renaming of the changed function. Functions changing documented behaviour can be nasty in many different ways when upgrading OpenSSL.

AdrianBunk on 12 Apr 2020

Can we stop having this discussion in 2 places at the same time?

kroeckx on 12 Apr 2020

Can we stop having this discussion in 2 places at the same time?

I'll answer to your email here.

The fix is to not call SSL_read() or SSL_shutdown() when you don't need to do it. This was always the case. But some people just decided to ignore the error, and have a problem when OpenSSL gave a different error.

How does this match the official OpenSSL documentation shipped in current and past stable OpenSSL releases?

Current documentation instructs to use a bidirectional shutdown with SSL_shutdown() for resuming a session.

Documentation was recently changed to tell people to call SSL_read() for a bidirectional shutdown if SSL_shutdown() returned 0. The documentation also instructs to ignore SSL_ERROR_SYSCALL in that case.

AdrianBunk on 12 Apr 2020

The fix is to not call SSL_read() or SSL_shutdown() when you don't need to do it. This was always the case. But some people just decided to ignore the error, and have a problem when OpenSSL gave a different error.

How does this match the official OpenSSL documentation shipped in current and past stable OpenSSL releases?

Like I said in the mail, documentation can always be improved. I
don't see the current documentation say that a shutdown is
required, but can understand that it might not be clear.

Current documentation instructs to use a bidirectional shutdown with SSL_shutdown() for resuming a session.

That's not what it says, that's for the special case of a client
that only sends. If it's using TLS 1.3, and it wants to resume,
it should read.

Documentation was recently changed to tell people to call SSL_read() for a bidirectional shutdown if SSL_shutdown() returned 0. The documentation also instructs to ignore SSL_ERROR_SYSCALL in that case.

It's still possible to use SSL_shutdown() a 2nd time. However, if
you did not read all the data, SSL_shutdown() will return an
error. If you use SSL_read() instead, you're sure that you do read
all the data.

I think there is also a misunderstanding of what that
old text about SSL_ERROR_SYSCALL meant. If SSL_shutdown()
returns 0, there is no error. If you call SSL_get_error() when
there is no error, it will always return SSL_ERROR_SYSCALL.
The text said to not call SSL_get_error().

kroeckx on 12 Apr 2020

I think there is also a misunderstanding of what that
old text about SSL_ERROR_SYSCALL meant. If SSL_shutdown()
returns 0, there is no error. If you call SSL_get_error() when
there is no error, it will always return SSL_ERROR_SYSCALL.
The text said to not call SSL_get_error().

I should probably clarify that applications that have a problem with
the new error, and are using SSL_shutdown() 2 times, it's the 2nd
SSL_shutdown() that fails, where SSL_shutdown() returns -1 both with
the old and the new error.

kroeckx on 12 Apr 2020

That's not what it says, that's for the special case of a client
that only sends. If it's using TLS 1.3, and it wants to resume,
it should read.

I guess the documentation can be adjusted to actually say that the
client needs to wait for the close notify of the server, there
might be cases where the client does not need to send a close
notify itself.

kroeckx on 12 Apr 2020

I was tacitly assuming that the application protocol does not support half-close, and that at the point at which SSL_shutdown() was being called, neither side had anything more to send. In which one should call either SSL_shutdown() again, or as you point out keep reading and processing (perhaps discarding in a sufficiently odd corner case) residual data until the peer has nothing further to send. I don't understand what the SSL_read() of zero bytes was supposed to accomplish.

People are following the documentation for return value 0 of SSL_shutdown().

The documentation says:

       ... SSL_read() should be called until all data is received.
       SSL_read() will indicate the end of the peer data by returning <= 0 and
       SSL_get_error() returning SSL_ERROR_ZERO_RETURN.

Well, a single SSL_read() of zero bytes is not the right way to call SSL_read() until all data is received. To do that, you need a loop, that reads more than zero bytes on each iteration.

Reading zero bytes is AFAIK only correct when you're sure (based on previous application traffic or implicit application protocol properties) that the other side has nothing more to send, and are just using SSL_read() to consume the close notify (and perhaps any session tickets from the server if nothing has been read at all so far). But in that case, calling SSL_shutdown() again is just as effective. And if any tickets are already received, then there's no need for the read at all, you don't have to receive the server's close notify if that's all the server can legitimately send.

In a more general (potential TLS-level half-close) context, the SSL_read() needs to be a read-and-discard loop.

vdukhovni on 13 Apr 2020

I was tacitly assuming that the application protocol does not support half-close, and that at the point at which SSL_shutdown() was being called, neither side had anything more to send. In which one should call either SSL_shutdown() again, or as you point out keep reading and processing (perhaps discarding in a sufficiently odd corner case) residual data until the peer has nothing further to send. I don't understand what the SSL_read() of zero bytes was supposed to accomplish.

People are following the documentation for return value 0 of SSL_shutdown().

The documentation says:
       ... SSL_read() should be called until all data is received.
       SSL_read() will indicate the end of the peer data by returning <= 0 and
       SSL_get_error() returning SSL_ERROR_ZERO_RETURN.
Well, a single SSL_read() of zero bytes is not the right way to call SSL_read() until all data is received. To do that, you need a loop, that reads more than zero bytes on each iteration.

Reading zero bytes is AFAIK only correct when you're sure (based on previous application traffic or implicit application protocol properties) that the other side has nothing more to send, and are just using SSL_read() to consume the close notify (and perhaps any session tickets from the server if nothing has been read at all so far). But in that case, calling SSL_shutdown() again is just as effective. And if any tickets are already received, then there's no need for the read at all, you don't have to receive the server's close notify if that's all the server can legitimately send.

In a more general (potential TLS-level half-close) context, the SSL_read() needs to be a read-and-discard loop.

This misses the problem, the problem is not the size of the read in code for demonstrating the bug.

Relevant is the following part of the documentation:

       SSL_shutdown() tries to send the close_notify shutdown alert to the
       peer.  Whether the operation succeeds or not, the SSL_SENT_SHUTDOWN
       flag is set and a currently open session is considered closed and good
       and will be kept in the session cache for further reuse.

Based on what is documented, it would be a bug in OpenSSL if any future version of OpenSSL would destroy the session.

AdrianBunk on 13 Apr 2020

On Sun, Apr 12, 2020 at 11:28:08PM -0700, Adrian Bunk wrote:

This misses the problem, the problem is not the size of the read in code for demonstrating the bug.

Relevant is the following part of the documentation:
       SSL_shutdown() tries to send the close_notify shutdown alert to the
       peer.  Whether the operation succeeds or not, the SSL_SENT_SHUTDOWN
       flag is set and a currently open session is considered closed and good
       and will be kept in the session cache for further reuse.

That assumes you don't do any other calls anymore. Clearly if you
do an other call, it's not closed anymore.

kroeckx on 13 Apr 2020

On Sun, Apr 12, 2020 at 11:28:08PM -0700, Adrian Bunk wrote: This misses the problem, the problem is not the size of the read in code for demonstrating the bug. Relevant is the following part of the documentation: SSL_shutdown() tries to send the close_notify shutdown alert to the peer. Whether the operation succeeds or not, the SSL_SENT_SHUTDOWN flag is set and a currently open session is considered closed and good and will be kept in the session cache for further reuse.
That assumes you don't do any other calls anymore. Clearly if you do an other call, it's not closed anymore.

What you call _clearly_ is not what you documented in commit 00f561ab9c70dec128467fb2b4f3eb952829c4c4:

    The shutdown is not yet finished: the close_notify was sent but the peer
    did not send it back yet.
    Call SSL_read() to do a bidirectional shutdown.

Existing software has been written based on your documentation, not based on what is in your mind.

And no, changing the documentation would not solve the problem that existing software relies on what is both the currently implemented behaviour and is promised in the documentation of current and past stable OpenSSL releases.

AdrianBunk on 13 Apr 2020

Reading https://tools.ietf.org/html/rfc8446#section-6.1 again, it says:

Each party MUST send a "close_notify" alert before closing its write side of the connection

kroeckx on 13 Apr 2020

You seem to be an expert in reading things in the documentation that it doesn't say.

kroeckx on 13 Apr 2020

You seem to be an expert in reading things in the documentation that it doesn't say.

Please ask people from the target audience of the documentation, programmers who are not OpenSSL developers and are not TLS experts, what it says.

I am claiming that SSL_shutdown(3ssl) says the following:

When SSL_shutdown() returns 0 you must call SSL_read().
Following this instruction is not documented to invalidate anything promised earlier in the same manpage.

AdrianBunk on 13 Apr 2020

And I have yet to see which part of the documentation is actually wrong. I'm not saying it's perfect, everything can be improved, but I currently don't see it stating anything that's not correct. And even if it does, I don't see why we shouldn't treat this just like any other bug and fix it.

kroeckx on 13 Apr 2020

So I've just created #11531

kroeckx on 13 Apr 2020

And I have yet to see which part of the documentation is actually wrong. I'm not saying it's perfect, everything can be improved, but I currently don't see it stating anything that's not correct. And even if it does, I don't see why we shouldn't treat this just like any other bug and fix it.

The current documentation is not wrong, it matches the implementation.

But changes that would make future OpenSSL versions no longer compliant with the current documentation would be very wrong.

Changing the documentation and then changing OpenSSL to match the changed documentation would screw all users who implemented their code based on what you wrote in the current documentation.

AdrianBunk on 13 Apr 2020

We don't have any plans to change it in such a way. The only thing we plan on doing is:

Changing an error code
Properly marking a session with an error as non reusable.

kroeckx on 13 Apr 2020

We don't have any plans to change it in such a way. The only thing we plan on doing is:
* Changing an error code

* Properly marking a session with an error as non reusable.

From a user point of view this would break working code that is correct according to the OpenSSL documentation at the time when it was written.

AdrianBunk on 13 Apr 2020

Adrian,

I don't see how the code was correct. I don't see how ignoring an
error can ever be correct. It has always been documented as a
"non-recoverable, fatal error". Both the old and the new error
say that.

If you don't have anything useful to contribute, I'm going to stop
replying.

kroeckx on 13 Apr 2020

Your documentation instructs the user to ignore SSL_ERROR_SYSCALL.

As I've already explained, it doesn't, it never has.

Most OpenSSL functions return <= 0 for error. If SSL_shutdown()
returns 0 there is no error, so you shouldn't call
SSL_get_error().

You need to pass the return value to SSL_get_error(), so you would
call SSL_get_error(ssl, 0). So you're telling SSL_get_error() that
some function did in fact say there was an error. But
SSL_get_error() doesn't know about any error, in which case it
returns SSL_ERROR_SYSCALL.

kroeckx on 13 Apr 2020

I don't see how ignoring an error can ever be correct.

The root problem is that you (and many others here) are focussing too much on the protocol side of OpenSSL. There is also a user API/ABI side, and most users of your API do not care at all about the details of what OpenSSL does. They follow your documentation, and expect that what is documented and tested to work today will continue to work with future OpenSSL versions.

You are not designing a shiny new library from scratch, you are maintaining an existing library with many thousands of users. Any change in behaviour does cause pain to users.

It is a problem that OpenSSL seems to lack a culture of trying to minimize the amount of user visible changes.

It has always been documented as a "non-recoverable, fatal error".

It has also been documented that the current session will always stay valid after SSL_shutdown(), and that the user must call SSL_read() if SSL_shutdown() returns 0.

It is not the fault of the user when your documentation is contractionary.

AdrianBunk on 13 Apr 2020

Repeating false claims about what the documentation says is not helpful.

kroeckx on 13 Apr 2020

I think a couple of things are being conflated here, leading to some confusion.

I think we can all agree that unexpected EOF _should_ be a fatal error in an ideal world, because RFC 8446 is very clear about the close_notify requirement.
The documentation is accurate if the server properly closes the TLS connection.

The main problem is how misbehaving servers are handled. If they are treated as a fatal SSL error, then the session should be invalidated according to the RFC:

Servers and clients MUST forget the secret values and keys established in failed
connections, with the exception of the PSKs associated with session
tickets, which SHOULD be discarded if possible.

This would unfortunately turn all connections to currently non-conforming servers into failed connections, breaking session resumption and causing lots of errors. Assuming that the server applications cannot be fixed in the near future, there is no perfect solution to the problem but changing this behavior in a major release instead seems reasonable to me.

roussosalex on 13 Apr 2020

👍1

I think a couple of things are being conflated here, leading to some confusion.

I think we can all agree that unexpected EOF _should_ be a fatal error in an ideal world, because RFC 8446 is very clear about the close_notify requirement.

Did this requirement appear in 8446 or it was present for earlier specifications too?

beldmit on 13 Apr 2020

All relevant TLS standards (1.2, 1.1 and 1.0) state:

The other party MUST respond with a close_notify
alert of its own and close down the connection immediately,
discarding any pending writes.

roussosalex on 13 Apr 2020

This would unfortunately turn all connections to currently non-conforming servers into failed connections, breaking session resumption and causing lots of errors. Assuming that the server applications cannot be fixed in the near future

Having tried to contact some of the HTTPS server operators, it's
unlikely that they will ever fix it, and it seems some HTTPS
clients also do not send the close notify. Since it seems that it's
really only some of the big players that violate it, I thought it
would be easy to get them to fix it, but it seems that's not the
case.

The only solution I see is that if you need to talk HTTPS, you don't
wait for the close notify.

If you're using something other than HTTPS, and the peer is also
not sending close notify, you can either try to get it fixed, or
not wait for it. And I think the only solution that's really going
to work is that you stop waiting for it.

And I think now is the time to fix it.

kroeckx on 13 Apr 2020

It is a problem that OpenSSL seems to lack a culture of trying to minimize the amount of user visible changes.

I think this claim is ill-informed; the culture is to minimize the amount of user-visible changes. We're all human, though, and sometimes things slip through. As 1.1.1f has illustrated, we are capable of reacting and fixing it when the wrong thing happened.

kaduk on 15 Apr 2020

👍1

Having tried to contact some of the HTTPS server operators, it's
unlikely that they will ever fix it

Unfortunate, but expected.

the only solution that's really going
to work is that you stop waiting for it.

Alright, then the SSL_shutdown documentation should be updated to reflect that in 3.0. I still think there should be an option to ignore EOF for protocols that already handle truncation well.

roussosalex on 15 Apr 2020

Alright, then the SSL_shutdown documentation should be updated to reflect that in 3.0.

I did that in #11531

I still think there should be an option to ignore EOF for protocols that already handle truncation well.

I think that's a good suggestion. It's probably easier for
applications to use that than to avoid the SSL_read() call.
We should consider adding that in 1.1.1-stable.

kroeckx on 15 Apr 2020

👍1

There's already SSL_CTX_set_quiet_shutdown.

davidben on 15 Apr 2020

There's already SSL_CTX_set_quiet_shutdown.

That prevents sending the close_notify, which is one way to cause
the unexpected EOF.

I think some applications might want a mode where on EOF, we just
return that we have an EOF. SSL_read() can not indicate an EOF. I
think we would need something like SSL_ERROR_EOF, I don't think
reusing SSL_ERROR_ZERO_RETURN for that is a good idea.

kroeckx on 15 Apr 2020

I do not think we need to change 1.1.1 unless we introduce the "ignore EOF without close notify" toggle in master. In that case it would be nice to have it backported in 1.1.1 as well to allow applications to use it with 1.1.1 already so they do not need to handle it differently than for 3.0.

t8m on 16 Apr 2020

Openssl: New EOF detection breaks session resumption

Most helpful comment

All 80 comments

Related issues