Constructing a client with clientv3.New() without cfg.DialTimeout set is a non-blocking call if the etcd server is not available.
https://github.com/coreos/etcd/blob/bb744f6d2b2715ec4aac696e0de51c75955334a8/clientv3/client.go#L427-L448
It is very counter-intuitive that setting DialTimeout makes New() a blocking call (and requires the server to be available).
Is there a reason that clientv3.New() must behave differently when DialTimout is set? I would expect the dial timeout to take effect whenever dialing, but not force dialing (and block until failure) in New(). There is no mention of that behavior change in the documentation for that config option as well:
https://github.com/coreos/etcd/blob/bb744f6d2b2715ec4aac696e0de51c75955334a8/clientv3/config.go#L33-L34
We found this unexpected behavior in an application which did the following:
In this scenario, without DialTimeout set, things worked fine. With DialTimeout set, the application failed because the blocking New() client construction prevented the embedded server from starting.
Obviously, we can reorder to start the embedded server first, but making New() blocking still slows down startup, since we construct several clients that previously did their work in the background. Now they all block until their balancer is ready, which happens serially.
@liggitt
Is there a reason that clientv3.New() must behave differently when DialTimout is set?
We wanted to simulate grpc.WithBlock in etcd layer. Current behavior is 1) no dial timeout to block forever until connection up, and 2) dial timeout to wait up to a fixed amount of time until connection up.
There is no mention of that behavior change in the documentation for that config option as well:
Agree that this should be documented better. We are almost done with client v3 balancer rewrite, and will make sure to document this, when it gets merged.
Current behavior is 1) no dial timeout to block forever until connection up, and 2) dial timeout to wait up to a fixed amount of time until connection up.
the current behavior of client.New is "no dial timeout means New doesn't block and connection attempt happens in the background"
"no dial timeout means New doesn't block and connection attempt happens in the background"
Yes, this is accurate. Will improve our docs. Thanks!
The initial issue described that New() with a DialTimeout lead to a blocking call, while New() without a DialTimeout lead to no blocking.
I'm currently experiencing a problem where I do use a DialTimeout, but the behaviour is completely different from the description:
New() call, but instead in the following cli.Status() callDialTimeout of 2 seconds doesn't seem to have any effectThis is all while I have no etcd instance running. As soon as I start one, everything works fine. But I need to be able to handle the case that my etcd instance is down, and preferrably without having to manually use a goroutine and timer to do this.
Versions:
git log first entry in "src\go.etcd.io\etcd": Commit ae25c5e1320f731a2ffaafbf756aca8b0a94dfab
Code to reproduce:
package main
import (
"context"
"fmt"
"time"
"go.etcd.io/etcd/clientv3"
)
func main() {
config := clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 2 * time.Second,
}
cli, err := clientv3.New(config)
if err != nil {
fmt.Println("connection error 1")
}
fmt.Println("got the client")
statusRes, err := cli.Status(context.Background(), "localhost:2379") // Waits here indefinitely
if err != nil || statusRes == nil {
fmt.Println("connection error 2")
}
fmt.Println("connection ok")
}
Note: I'm a Go beginner
In case others have the same problem, here's a workaround that calls the etcd client code in a goroutine, returning the client via channel, and utilizes a timeout in case the etcd client code blocks forever:
package main
import (
"context"
"errors"
"time"
"go.etcd.io/etcd/clientv3"
)
func main() {
cli, err := NewClient()
if err != nil {
panic(err)
}
_, err = cli.Put(context.Background(), "foo", "bar")
if err != nil {
panic(err)
}
}
// NewClient creates a new etcd client and takes care of a timeout
func NewClient() (*clientv3.Client, error) {
// The behaviour for New() seems to be inconsistent.
// It should block at most for the specified time in DialTimeout.
// In our case though New() doesn't block, but instead the following call does.
// Maybe it's just the specific version we're using.
// See https://github.com/etcd-io/etcd/issues/9829#issuecomment-438434795.
// Use own timeout as workaround.
// TODO: Remove workaround after etcd behaviour has been fixed or clarified.
config := clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 2 * time.Second,
}
errChan := make(chan error, 1)
cliChan := make(chan *clientv3.Client, 1)
go func() {
cli, err := clientv3.New(config)
if err != nil {
errChan <- err
return
}
statusRes, err := cli.Status(context.Background(), config.Endpoints[0])
if err != nil {
errChan <- err
return
} else if statusRes == nil {
errChan <- errors.New("The status response from etcd was nil")
return
}
cliChan <- cli
}()
select {
case err := <-errChan:
return nil, err
case cli := <-cliChan:
return cli, nil
case <-time.After(3 * time.Second):
return nil, errors.New("A timeout occured while trying to connect to the etcd server")
}
}
Closing via https://github.com/etcd-io/etcd/blob/master/Documentation/upgrades/upgrade_3_4.md#require-grpcwithblock-for-client-dial and https://github.com/kubernetes/kubernetes/pull/81435. Also, highlighted in our change log https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md.
Most helpful comment
In case others have the same problem, here's a workaround that calls the etcd client code in a goroutine, returning the client via channel, and utilizes a timeout in case the etcd client code blocks forever: