hi,
i'm using etcd 3.0.15 behind a load-balancer.
when a etcd-node goes down, the clients get moved to a different node. after that, all requests from the moved-clients fail and the log of etcd is spammed with "invalid auth token: gapUm...". the only solution i found so far, is to restart the client.
this happens for example when i do a rolling update of the etcd-servers.
the client is using c31b1ab8d18ff7990a43bd258ca54f80c5a3c952 at the moment.
Hi @ekle , thanks a lot for your report. I found a problem in the client side logic so will fix it tomorrow or next week.
The detail of the problem is like below (assume s1, s2, and s3 are servers, c is a client):
Authenticate() RPC and a token are generated, all of s1, s2 and s3 shares it as a valid tokenFor avoiding the problem, clients need to issue Authenticate() RPC again in a case of the invalid token error. For example, clientv3.kv.Do() will be like below:
func (kv *kv) Do(ctx context.Context, op Op) (OpResponse, error) {
nrAuthRetry := 0
for {
resp, err := kv.do(ctx, op)
if err == nil {
return resp, nil
}
if isHaltErr(ctx, err) {
return resp, toErr(ctx, err)
} else if isAuthErr(ctx, err) {
if nrAuthRetry != 0 {
// auth info was updated on server side
return resp, toErr(ctx, err)
}
// reauth here and refresh a token
nrAuthRetry++
continue
}
// do not retry on modifications
if op.isWrite() {
return resp, toErr(ctx, err)
}
}
}
The change won't be difficult technically but the amount of changed lines will be large. So I'd like to have a consensus on the direction (adding a new code for handling the error in each RPC caller of clientv3). Is this ok? @xiang90 @heyitsanthony @gyuho Anyway, this is required for handling TTL of tokens (related to https://github.com/coreos/etcd/pull/6574).
@ekle I noticed that the mechanism should be implemented with existing retry mechanism. It would need a time for a while, sorry for keeping you waiting.
I'm digging the problem and it was more complicated than I think. The error of invalid token can be caught here: https://github.com/coreos/etcd/blob/master/clientv3/retry.go#L36 but current gRPC doesn't seem to provide a way of reconfigure PerRPCCredential of connection. So we don't have a way for installing a fresh token in the error handling process. I'll seek a workaround for a while. If I cannot find it, I'll talk to gRPC team about a new API for the purpose.
@ekle I could find a reasonable workaround and created a PR here: https://github.com/coreos/etcd/pull/7110 could you try it?
i can try it on monday
@mitake i tried it with 1ea73ec0c01ffecc9934563ad37f6d61cb37c39a and i still have the error invalid auth token.
do i need to enable this somehow ?
@ekle thanks for your testing and sorry I botched the PR during my cleanup... could you try it again?
BTW, did you update both of your client and your server? Did the log of server spam the same "invalid auth token..." message?
only the client. i was not aware that i had to update the server too.
yes the message was the same.
I see. Then could you update both of your components with the latest version of the PR?
ok, updated the server and the client to c803baaf9b2b783f5e03e67fc9bb42d4fa730964. still not working
Thanks for your testing again. Is it possible to share your client code? If it is possible, I'd like to know RPCs used by your client. The information will be useful for reproducing the problem on my side.
sorry, i cannot share the code.
i use only the KV-RPCs with leases and this transaction:
cmp = v3.Compare(v3.Value(key), "=", value)
put = v3.OpPut(key, value, v3.WithLease(leaseId))
get = v3.OpGet(key)
resp, err = ETCD.Txn(ctx).If(cmp).Then(put).Else(get).Commit()
i tried to reproduce the problem with just one etcd-instance by restarting it, but it didn't work.
Thanks for your sharing. And I noticed that the PR doesn't add a retry mechanism to lease and some other RPCs... I'll update it for the purpose tomorrow.
@ekle could you try the latest https://github.com/coreos/etcd/pull/7110 ? It resolved problems related to lease RPCs so I hope it works fine for your client.
the good thing first: i don't have to restart the client anymore to recover.
the bad thing: when i do a rolling restart of the etcd-cluster, sometimes the nodes lose the auth-enabled status leading to invalid auth token: gapUm... again. don't know if this is related.
@ekle thanks for testing. And the latter problem would be a different topic from this PR (I could also reproduce the problem). I'll open another PR for it later.
hi, I encountered an issue that when the client tried to call func (c *Client) getToken(ctx context.Context) error
although it looped through all the endpoints, once the ctx exceeded the deadline for the first endpoint, the following endpoints would fail too
In case of the first endpoint got some issue and having timeout, the client creation could never succeed.
@mitake
Thanks for your sharing. And I noticed that the PR doesn't add a retry mechanism to lease and some other RPCs... I'll update it for the purpose tomorrow.
@rayzyar thanks for reporting. Handling the situation would be a little bit complex because we need to determine how long time can we use for each Authenticate() RPC to the endpoints. I created an easy fix here: https://github.com/mitake/etcd/tree/get-token-timeout could you try it?
Most helpful comment
Hi @ekle , thanks a lot for your report. I found a problem in the client side logic so will fix it tomorrow or next week.