Etcd: "invalid auth token: gapUm..."

Created on 15 Dec 2016 · 19Comments · Source: etcd-io/etcd

hi,
i'm using etcd 3.0.15 behind a load-balancer.

when a etcd-node goes down, the clients get moved to a different node. after that, all requests from the moved-clients fail and the log of etcd is spammed with "invalid auth token: gapUm...". the only solution i found so far, is to restart the client.

this happens for example when i do a rolling update of the etcd-servers.

the client is using c31b1ab8d18ff7990a43bd258ca54f80c5a3c952 at the moment.

Source

ekle

Most helpful comment

Hi @ekle , thanks a lot for your report. I found a problem in the client side logic so will fix it tomorrow or next week.

mitake on 15 Dec 2016

😄1 👍1

All 19 comments

Hi @ekle , thanks a lot for your report. I found a problem in the client side logic so will fix it tomorrow or next week.

mitake on 15 Dec 2016

😄1 👍1

The detail of the problem is like below (assume s1, s2, and s3 are servers, c is a client):

at first, all of s1, s2, and s3 are healthy
c issues Authenticate() RPC and a token are generated, all of s1, s2 and s3 shares it as a valid token
s1 fails, s2 and s3 serves requests from c
s1 revives
s2 and s3 fail (in this case the quorum is lost but it isn't important for now)
c tries to issue RPCs with the token generated in 2. but s1 doesn't have it (Because of the nature of simple token, it cannot be revived with log entry replay. The problem isn't shared with jwt token)
the spam of "invalid auth token" appears on s1

For avoiding the problem, clients need to issue Authenticate() RPC again in a case of the invalid token error. For example, clientv3.kv.Do() will be like below:

func (kv *kv) Do(ctx context.Context, op Op) (OpResponse, error) {
nrAuthRetry := 0
    for {
        resp, err := kv.do(ctx, op)
        if err == nil {
            return resp, nil
        }

        if isHaltErr(ctx, err) {
            return resp, toErr(ctx, err)
        } else if isAuthErr(ctx, err) {
            if nrAuthRetry != 0 {
                // auth info was updated on server side
                return resp, toErr(ctx, err)
            }

            // reauth here and refresh a token
            nrAuthRetry++
            continue
        }
        // do not retry on modifications
        if op.isWrite() {
            return resp, toErr(ctx, err)
        }
    }
}

The change won't be difficult technically but the amount of changed lines will be large. So I'd like to have a consensus on the direction (adding a new code for handling the error in each RPC caller of clientv3). Is this ok? @xiang90 @heyitsanthony @gyuho Anyway, this is required for handling TTL of tokens (related to https://github.com/coreos/etcd/pull/6574).

mitake on 20 Dec 2016

@ekle I noticed that the mechanism should be implemented with existing retry mechanism. It would need a time for a while, sorry for keeping you waiting.

mitake on 27 Dec 2016

I'm digging the problem and it was more complicated than I think. The error of invalid token can be caught here: https://github.com/coreos/etcd/blob/master/clientv3/retry.go#L36 but current gRPC doesn't seem to provide a way of reconfigure PerRPCCredential of connection. So we don't have a way for installing a fresh token in the error handling process. I'll seek a workaround for a while. If I cannot find it, I'll talk to gRPC team about a new API for the purpose.

mitake on 5 Jan 2017

@ekle I could find a reasonable workaround and created a PR here: https://github.com/coreos/etcd/pull/7110 could you try it?

mitake on 6 Jan 2017

i can try it on monday

ekle on 6 Jan 2017

@mitake i tried it with 1ea73ec0c01ffecc9934563ad37f6d61cb37c39a and i still have the error invalid auth token.
do i need to enable this somehow ?

ekle on 9 Jan 2017

@ekle thanks for your testing and sorry I botched the PR during my cleanup... could you try it again?

BTW, did you update both of your client and your server? Did the log of server spam the same "invalid auth token..." message?

mitake on 9 Jan 2017

only the client. i was not aware that i had to update the server too.
yes the message was the same.

ekle on 9 Jan 2017

I see. Then could you update both of your components with the latest version of the PR?

mitake on 9 Jan 2017

ok, updated the server and the client to c803baaf9b2b783f5e03e67fc9bb42d4fa730964. still not working

ekle on 9 Jan 2017

Thanks for your testing again. Is it possible to share your client code? If it is possible, I'd like to know RPCs used by your client. The information will be useful for reproducing the problem on my side.

mitake on 9 Jan 2017

sorry, i cannot share the code.
i use only the KV-RPCs with leases and this transaction:
cmp = v3.Compare(v3.Value(key), "=", value)
put = v3.OpPut(key, value, v3.WithLease(leaseId))
get = v3.OpGet(key)
resp, err = ETCD.Txn(ctx).If(cmp).Then(put).Else(get).Commit()

i tried to reproduce the problem with just one etcd-instance by restarting it, but it didn't work.

ekle on 10 Jan 2017

Thanks for your sharing. And I noticed that the PR doesn't add a retry mechanism to lease and some other RPCs... I'll update it for the purpose tomorrow.

mitake on 10 Jan 2017

@ekle could you try the latest https://github.com/coreos/etcd/pull/7110 ? It resolved problems related to lease RPCs so I hope it works fine for your client.

mitake on 12 Jan 2017

the good thing first: i don't have to restart the client anymore to recover.
the bad thing: when i do a rolling restart of the etcd-cluster, sometimes the nodes lose the auth-enabled status leading to invalid auth token: gapUm... again. don't know if this is related.

ekle on 12 Jan 2017

@ekle thanks for testing. And the latter problem would be a different topic from this PR (I could also reproduce the problem). I'll open another PR for it later.

mitake on 13 Jan 2017

hi, I encountered an issue that when the client tried to call func (c *Client) getToken(ctx context.Context) error
although it looped through all the endpoints, once the ctx exceeded the deadline for the first endpoint, the following endpoints would fail too

In case of the first endpoint got some issue and having timeout, the client creation could never succeed.
@mitake

Thanks for your sharing. And I noticed that the PR doesn't add a retry mechanism to lease and some other RPCs... I'll update it for the purpose tomorrow.

rayzyar on 9 Apr 2018

@rayzyar thanks for reporting. Handling the situation would be a little bit complex because we need to determine how long time can we use for each Authenticate() RPC to the endpoints. I created an easy fix here: https://github.com/mitake/etcd/tree/get-token-timeout could you try it?

mitake on 9 Apr 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings