Go: net, os: Set*Deadline() expiration error should be unique, as .Timeout() is true for keepalive, etc

Created on 13 Apr 2019  Â·  59Comments  Â·  Source: golang/go

This is a bug. A keepalive error is a connection failure, not a deadline event.

net Docs say:

A zero value for [deadline] means I/O operations will not time out.

A zero value for [read deadline] means Read will not time out.

KeepAlive specifies the keep-alive period for an active network connection.
If zero, keep-alives are enabled if supported by the protocol and operating system.
Network protocols or operating systems that do not support keep-alives ignore this field.
If negative, keep-alives are disabled.

For a TLS connection that's been severed, Conn.Read() returns a net.Error with .Timeout()==true due to KeepAlive failure. (Go 1.12, Linux, amd64)

The Error should give .Timeout()==false to comply with the docs. Code that checks for .Timeout()==true would generally assume that an explicit deadline had passed.

The .Error() string should mention keepalive. It's currently:
"read tcp n.n.n.n:p->n.n.n.n:p: read: connection timed out"

Related: net.Dialer.KeepAlive gets 15*time.Second if unset/zero. This isn't documented in package net.

cc @bradfitz

NeedsDecision release-blocker

Most helpful comment

ETIMEDOUT.Timeout() has been true for _many_ releases. It seems quite clear we cannot redefine that now.

I was sympathetic to CL 188657 until I searched the Linux kernel sources for ETIMEDOUT, as Filippo mentioned above. It gets returned from an enormous number of places, most of them having nothing to do with TCP, and many of them having nothing to do with networking. So translating ETIMEDOUT to ErrKeepAlive is clearly incorrect in the overwhelming majority of places where it appears in the kernel source. I don't believe we can be more precise here without a tremendous amount of system-specific code.

Keepalives have been in previous releases and turned into errors with err.Timeout() == true. I realize they are a little more common now, but even so it does not seem terrible to continue the behavior of the past 13 releases into a 14th.

I suggest we leave this alone for Go 1.13.

For Go 1.14 maybe the answer is to change our net.Conn implementations to return an error that Is(context.DeadlineExceeded), to allow a more precise check than Timeout.

All 59 comments

Also filed #31490 to report that tls.DialWithDialer() doesn't respect .KeepAlive.

Since 1.11, net.Dialer.KeepAlive gets 15*time.Second if unset/zero. This isn't documented in package net.

It is documented that we enable keep-alives if that field is unset/zero. The choice of not specifying 15s in the docs was deliberate and is actually documented in the commit message: https://github.com/golang/go/commit/5bd7e9c54f946eec95d32762e7e9e1222504bfc1

I believe this is a new bug in 1.12, so a fix should be backported to a 1.12.x release.
Also reported in #32735

cc @ianlancetaylor
@gopherbot add release-blocker

Do you know what makes this new in 1.12? I'm not seeing it.

This decided on keepalive by default in 1.12: #23459. That causes deadline handlers in existing code to see non-deadline (i.e. keepalive) errors.

If I understand you correctly, you are saying that the bug has existed for a long time for programs that enable TCP keepalive by setting the KeepAlive field in net.Dialer, but that it is more likely to occur in 1.12 because now that field is set by default. Is that correct?

Yes. It's probably rare to use both deadlines and explicit keepalive, so the bug wasn't reported.

Now any code with long deadlines (relatively common) will wrongly detect deadline events due to implicit keepalive.

Go 1.13 also enables Keep-Alives by default on the net.Listen side (1abf3aa55bb8b346bb1575ac8db5022f215df65a), so this might be worth fixing now, before exposing a new wave of applications to it.

@costela, your keepalive patch will trigger this bug...

@gopherbot please open backport 1.12

Backport issue(s) opened: #33137 (for 1.12).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

The Error should give .Timeout()==false to comply with the docs. Code that checks for .Timeout()==true would generally assume that an explicit deadline had passed.

I can't seem to find anything in the docs saying that a keepalive timeout _should not_ set .Timeout()==true. IMHO this is not necessarily obvious and should be clarified in the docs if it's indeed intended.

This isn't to say the new default behavior wouldn't introduce a potentially breaking change, but I'm not sure changing the error to be .Timeout()==false is the right approach. Just as there might be code depending on .Timeout()==true for detecting deadlines, there might be code depending on the same behavior for explicitly set keepalives. Or am I missing something obvious?

Docs say "A zero value for [deadline] means I/O operations will not time out."

Don't worry about it; it's been assigned. Sorry to bother you.

Deadlines and keep-alive errors are deeply different: the former are fully recoverable, the latter aren't, for example. It seems unlikely any code would ever want to handle them the same way.

Keep-alives are more akin to the connection dropping, so I don't think marking them as Timeouts makes sense. I'll make this change tomorrow.

@FiloSottile, let's make the .Error() string mention keep-alive. It's currently:
"read tcp n.n.n.n:p->n.n.n.n:p: read: connection timed out"

Also I opened a 1.12.x backport issue.

I looked into this, and as far as I can tell, there is no way to single out keep-alive errors: they just surface a ETIMEDOUT from the read().

What we might do, is isolate deadline errors, which as far as I can tell (although I am no expert on that part of the code) are handled by runtime timers, and make those more uniquely recognizable. For example, by making them an exported error type.

We might even decide to make only those have Timeout() true. Currently Timeout() is true for e == EAGAIN || e == EWOULDBLOCK || e == ETIMEDOUT and for some DNS resolution errors. That line has last been touched in 2011, and I don't feel confident about how each of those ids map to a timeout.

If I'm right, this is not a small fix to a specific error value anymore, and I don't think it's something we should ship at the very end of the freeze. /cc @golang/osp-team

I believe that if a read from a network connection fails due to a deadline, the read will return internal/poll.ErrTimeout. I believe that if a read fails due to a keep-alive error, the read will return syscall.ETIMEDOUT.

Ian, can we make net.OpError.Timeout() check internal/poll.ErrTimeout?

Altho we could undo #23378 for 1.13, we still need to fix this in 1.12.x & 1.13 due to #23459.

Seems a fair bit simpler to change internal/poll.ErrTimeout to not return true for the Timeout method. Or just not implement the Timeout method. Might be a good idea to change the name, of course.

People commonly test for a deadline event via err.Timeout() == true; that has to keep working. It should be false in all other circumstances.

Oh, right, I got mixed up.

What we want is for syscall.ETIMEDOUT to not return true for the Timeout method. As @FiloSottile said. So the question is really whether we can make that change for 1.13 at this point in the release process. That is a hard call.

Well I don't think we should revert both of the default-keep-alive commits (and back-port one revert to 1.12), but that's the logical alternative to fixing this now.

I feel like for this change to be complete, we should make syscall.Errno.Timeout() always return false, and make internal/poll.ErrTimeout the only error with Timeout true. That way the method can be used reliably to detect when deadlines are exceeded, which is the only practical use case for it that I can think of. (Said another way, you might want to handle differently deadline errors, as they can be reset, but AFAIK no others.)

By the way, I noticed that keep-alive errors are also marked Temporary, which is definitely not the case, and we should fix that. (Although I admit I don't entirely grok what Temporary means, as internal/poll.ErrTimeout is also Temporary but surely retrying the operation without resetting the deadline would not work, and crypto/tls.timeoutError has Temporary even if a number of permanent reasons can cause a dial to timeout.)

https://github.com/golang/go/blob/b41eee4c8a2fe692c1d9fb46972b9047b5dc02b7/src/syscall/syscall_unix.go#L138-L144

.Temporary() is also slated to be fixed in 1.13 #32463

32463 is about os.ErrTemporary, not net.Error.Temporary, but it does indeed come to a similar conclusion about its undefinedness.

(A further complication is that I remain unconvinced that "Temporary" is even well defined as a concept—I cannot actually explain precisely what it means for an error to be temporary—but my comments above apply even if we assume it is well-defined as a property. The problem is that errors aren't properties, and so errors.Is probably isn't appropriate for testing properties.)

There's also os.{Syscall,Path}Error.Timeout() -- should they remain unaffected by this fix?

There's also os.{Syscall,Path}Error.Timeout() -- should they remain unaffected by this fix?

At this point in the freeze, certainly. We are only considering ETIMEDOUT because the new default server keep-alive will make it surface much more often.

I am growing convinced though that net.Error failed at capturing meaningful semantics, and we'll get a new try with the new error inspection mechanism, which should learn from this. (See https://github.com/golang/go/issues/32463#issuecomment-514742149.)

For this, how about we just extend the error messages with (deadline exceeded) and (keep-alive)?

just extend the error messages with (deadline exceeded) and (keep-alive)

Meaning add a new net "API" within the .Error() string? :-/

People currently detect deadline events with err.Timeout() == true, so we need to keep that code working in the presence of default keep-alive (syscall.Errno.Timeout() should return false). If we can't do that now, we need to revert both default-keep-alive commits, and back-port a revert to 1.12.x.

I do think amending the .Error() result for keep-alive errors would help; I listed that in the issue todos.

os.PathError.Timeout() should also only check internal/poll.ErrTimeout, triggered by os.File.SetDeadline().

os.SyscallError.Timeout() might be a mistake, but we can give it the current syscall.Errno.Timeout() logic so the change to the latter doesn't break the former.

@ianlancetaylor what do you think of the plan above, touching syscall.Errno and os.SyscallError?

The goal is to make net.Error and os.PathError only yield .Timeout() == true for internal/poll.ErrTimeout (i.e. deadline events), and not affect users of os.SyscallError.

That seems workable for 1.13.

First, we are only looking at keep-alive errors now, because they are the only ones tied to a Go 1.13 behavior change. We are not touching anything else months into the freeze.

Second, I am growing unconvinced we can/should fix this.

  1. We don't have any tests for it, nor are we equipped to write them. To test keep-alives we'd have to force a system-side state change, or start dropping packets at the network level.

    • socktest.Filter doesn't look like it would actually test the OS behavior.
    • Existing keep-alive tests just use testHookSetKeepAlive to check setKeepAlive[Period] is called.
  2. ETIMEDOUT is just as unspecified as our net.Error.Timeout(). I couldn't even find any docs that guarantee a keep-alive error causes ETIMEDOUT, and on the other hand there are apparently hundreds of non-keep-alive conditions that cause ETIMEDOUT (from skimming the Linux kernel sources).

The simplest change, removing ETIMEDOUT from syscall.Errno.Timeout(), caused net.TestDialTimeout to fail because it expects a Timeout() == true error from a Dial that ends in ETIMEDOUT. I have no idea what else would break, because I assume the remaining behaviors are just as untestable and hence untested, and the RC is too late to find out by letting it out into the world.

I suppose we could special-case ETIMEDOUT from read or write, and/or only from a TCP connection, but at this point I feel like I am just stabbing in the dark, without tests, on the week of the RC.

From another perspective, though, keep-alives always returned Timeout() == true errors, and we enabled them by default on clients in Go 1.12 and no one noticed. After looking more into the semantics of Timeout(), I think they are unfortunately meaningless, and we should just leave things as is instead of rocking the boat, try to do better with the new errors (see #33411), and eventually deprecate net.Error, os.IsTimeout, and the underspecified Timeout methods.

/cc @golang/osp-team @ianlancetaylor

Two issues were filed re spurious deadline events in 1.12. Such bugs can be hard to detect unless you check the elapsed time for deadlines! A lot more folks would notice them if they start appearing in server-side apps in 1.13.

Let's revert both keep-alive commits for 1.13, and back-port a revert to 1.12.

It's reasonable to ask users to explicitly enable keep-alive if required, and many of us already use deadlines as a context-sensitive keep-alive technique.

@networkimprov Can you point us to the two bugs filed against 1.12? I'm not excited about changing the 1.12 behavior at this point.

I note that it would be feasible to handle ETIMEDOUT errors from Read and Write specifically. It would mean changing a few functions in internal/poll/fd_unix.go to look for ETIMEDOUT and replace it with some other error. But it does seem risky to do that for 1.13 at this point.

32735 and this one (I encountered spurious deadlines in an app with long-running links to servers.)

This bug won't stop the app, but does change the way it was intended to work, with rather variable consequences.

Change https://golang.org/cl/188657 mentions this issue: internal/poll: if Read/Write get ETIMEDOUT, return ErrKeepAlive

For the record, https://golang.org/cl/188657 shows the change to not return ETIMEDOUT from a read or write system call. It's probably not a good idea to do this for 1.13 at this point.

ETIMEDOUT.Timeout() has been true for _many_ releases. It seems quite clear we cannot redefine that now.

I was sympathetic to CL 188657 until I searched the Linux kernel sources for ETIMEDOUT, as Filippo mentioned above. It gets returned from an enormous number of places, most of them having nothing to do with TCP, and many of them having nothing to do with networking. So translating ETIMEDOUT to ErrKeepAlive is clearly incorrect in the overwhelming majority of places where it appears in the kernel source. I don't believe we can be more precise here without a tremendous amount of system-specific code.

Keepalives have been in previous releases and turned into errors with err.Timeout() == true. I realize they are a little more common now, but even so it does not seem terrible to continue the behavior of the past 13 releases into a 14th.

I suggest we leave this alone for Go 1.13.

For Go 1.14 maybe the answer is to change our net.Conn implementations to return an error that Is(context.DeadlineExceeded), to allow a more precise check than Timeout.

Thanks, moving to 1.14.

Should I file an issue to revert #23378 and #23459 for 1.13?

In my opinion we should cover the issue in the release notes. I think many more people will be helped by turning on keep-alive by default than will be hurt by the confusion around the Timeout method.

A note like: "Code using SetDeadline() should generally disable keep-alive, which is now on by default for all TCP connections, both dialed and accepted" ...?

After many years of keep-alives off by default, I don't understand the urgency for default-on before these errors can be discerned from deadline events.

Because keep-alive is already on by default for 1.12, and people are using 1.12. Rolling back and then forward again is also confusing.

Change https://golang.org/cl/188819 mentions this issue: doc/go1.13: mention confusion between keep-alive timeout and deadline

The text added to the release notes should also live in the docs, and be referenced by net.Dial() & .Listen(), net.Error.Timeout(), and net.Conn.Set*Deadline().

The docs are currently explicit that only deadlines cause "time out". See quote in issue text.

Change https://golang.org/cl/189757 mentions this issue: net: document that a keep-alive failure also returns a timeout

I sent a CL to update the docs for SetDeadline. That seems to be the relevant place. This kind of detail would be out of place in the other locations mentioned.

Back in https://github.com/golang/go/issues/31449#issuecomment-517732888, I wrote:

For Go 1.14 maybe the answer is to change our net.Conn implementations to return an error that Is(context.DeadlineExceeded), to allow a more precise check than Timeout.

We did not get into this, and we are close to the freeze. It seems like we should probably put this off to next cycle. But I will retitle so it is easier to understand what this is about.

Above @rsc suggests that we change net.Conn.Read/Write to return context.DeadlineExceeded if they time out due to exceeding a deadline. That will change the error when printed from i/o timeout to context deadline exceeded. I don't think that would be an ideal change, as I think it is confusing to refer to a context when no context is involved.

I suggest that we instead add net.ErrDeadlineExceeded and os.ErrDeadlineExceeded and return those.

Actually, let's just add os.ErrDeadlineExceeded.

Maybe _go vet_ should flag use of err.Timeout() after Set*Deadline(), since that is not a reliable way to check for deadline-expired.

Why is this error related to package os? Aren't deadlines a net.Conn concept?

EDIT: I forgot that deadlines also exist for os.File.

Change https://golang.org/cl/228644 mentions this issue: net: return context.DeadlineExceeded if past context deadline

Any change to "go vet" can be a separate issue. Although I suspect that that change is not feasible as it would make some amount of existing code non-vet-compliant, which we can't really do.

The os package has SetDeadline just as the net package does.

Could net.ErrDeadlineExceeded be added as an alias to os.ErrDeadlineExceeded?

Filed #38508 for the vet issue.

Sure, we could have both net.ErrDeadlineExceeded and os.ErrDeadlineExceeded, but I don't see a reason to do that. The net package already depends on the os package. I'm open to it if there is a good reason for it.

It avoids the need to import "os" when calling net.Conn.Set*Deadline(). Your first instinct was to provide it :-)

Almost all code using Set*Deadline() now needs to be updated. It's odd that those changes should require an import unless "os" is already imported.

My guess is that very little code that uses SetDeadline will need to be updated. The only code that needs to be updated is code that needs to reliably determine whether the connection failed due to an exceeded deadline. Most programs will just see that the connection failed and carry on.

I would like to hear someone else's opinion on this question.

Change https://golang.org/cl/228645 mentions this issue: os, net: define and use os.ErrDeadlineExceeded

Ian's CL seems worth trying. Let's just be ready to roll it back if there are problems.

Change https://golang.org/cl/239705 mentions this issue: net: consistently document deadline handling

Was this page helpful?
0 / 5 - 0 ratings