Xud: achilles heel

Created on 9 Oct 2019  Â·  14Comments  Â·  Source: ExchangeUnion/xud

https://github.com/raiden-network/raiden/pull/5025 was merged to Raiden.

IMHO, this open the door for a "not atomic swap".

consider raiden as first leg and lnd as second leg. Alice and Bob agreed on a swap and swap started. Alice issue first leg to Bob, Bob's xud called from raiden and issued the second leg. Now, there was some kind of a problem on LND and bob's resolver returned HTTP 500 to raiden.

Outcome - alice got paid (second leg). Bob got the secret from Alice. Bob didn't get his payment from Alice. resolving the payment on-chain by Bob is very hard and practically impossible.

I haven't done full analysis of xud to check if there is a scenario that can cause it. Even if it is hard to find one now, we can't be sure that we will not add one in the future.

The resolution IMHO is that the resolver will only send back HTTP codes 200 (with secret) and 404 (not found). We can consider other codes if these represents situations that we understand. But 500 and other generic codes should never be used.

P1 security

All 14 comments

Raiden, when getting any code which it does not clearly understand should retry. If it gets 500 it must retry and should not assume that the maker does not have the secret.

If the request is not valid (400) raiden should assert(False) and crash.

Nice catch @offerm - I haven't tested this manually, but if I'm reading https://github.com/raiden-network/raiden/pull/5025/files correctly then Raiden will stop querying for the preimage from xud after 500 response code. This should not happen.

We can talk about this on the call today again but I am quite certain this is not a problem. I put my thoughts in that PR (https://github.com/raiden-network/raiden/pull/5025#discussion_r330642231) and also talked this over with @kilrau.

Currently xud only returns 500 if the swap failed and our payment did not go through for one reason or another. If there is a problem with LND such that payment goes through but LND responds to us as if it didn't, that is a problem regardless (we would treat that as a 404 if I'm understanding you correctly.

My main reasoning behind that PR that was merged in raiden was to use HTTP status codes as they were intended. 4xx status codes indicate a problem with the request - however a resolve request can fail even with a perfectly valid request.

We can all agree that xud should be sure not to respond with a 500 status code (or any error status code) if there's a possibility that our payment may go through and we may learn the preimage/secret eventually. That is the important part here, and it's not related to specific status codes. Basically there are 4 outcomes to a call.

1) Success (200 status code) with the preimage
2) Failure, permanent e.g. unknown hash, swap failed (any code besides 200 or 503)
3) Failure, but the resolve request should be retried (503 status code)
4) No response/connection refused, resolve request is retried

1 & 4 seem straightforward. xud currently does not respond in case 3 (but we could if we wanted to with 503 SERVICE UNAVAILABLE). xud responds in case 2 in case of a swap failure only when we get word that our outgoing payment has failed. We can double check that logic behaves as intended, but it's separate from the specific status code used.

The alternative of using only status code 404 for case 2 above doesn't make a meaningful difference if we are still returning it when we get word that our outgoing payment has failed. It just limits the status codes we can use to inform the client as to why the call failed.

One more point to consider: if raiden with a resolver configured does a regular raiden payment, it will ask the resolver for the preimage, resolver returns 404 and raiden continues and asks initiator for preimage.

I am thinking that if we want to definitively inform raiden that a resolver request permanently failed, perhaps we should not rely on HTTP status codes for that at all. They are, by design, somewhat broad and not exactly intended for our use case in the resolver (which technically need not use HTTP at all, it would work just as well with TCP).

Perhaps what we should do then is return a JSON body in the error response with a retry property that is either true or false. If it is true - or if raiden does not receive a valid response - then raiden retries. If it is false then raiden will stop retrying.

Perhaps what we should do then is return a JSON body in the error response with a retry property that is either true or false. If it is true - or if raiden does not receive a valid response - then raiden retries. If it is false then raiden will stop retrying.

That actually sounds like a good compromise. We can give detailed error info in the JSON body and clearly indicate if to retry or not while avoiding anyone else "accidentally" returning a 500 code and make raiden stop retrying.

What do you think? @offerm @erkarl

I'm good with the proposed solution from @sangaman.

Thinking more about it:

a. providing retry property in a json object will not take away the need to handle all kinds of HTTP errors. The raiden sind, when getting an HTTP code that it does not know how to handle must retry.

b. The outcome of a raiden call to the resolver can only be:
i) the preimage - swap continues
ii) an response saying that the resolver will never have the preimage - payment continues
iii) invalid request (format error or something like that) - raiden should crash
Any thing else must convert to a retry.

Doing this with json just complicates things since we need to check both HTTP and the json itself.

My suggestion is to go back to the 200/404 we had before.

Since we have only 2 outcomes

ii) an response saying that the resolver will never have the preimage - payment continues

This would be the case where it responds with "retry": false. In any other case, raiden should retry the call, regardless of http status code. I think that is clearer and more explicit than relying on a particular status code.

ii) an response saying that the resolver will never have the preimage - payment continues

This would be the case where it responds with "retry": false. In any other case, raiden should retry the call, regardless of http status code. I think that is clearer and more explicit than relying on a particular status code.

The way I understand it if raiden receives any other code except 200 OK it should continue to hitting the resolver endpoint for the preimage until it receives "retry: false". This approach seems unambiguous to me.

This sound good to me.

So, raiden should issue the HTTP request and use a logic like this:

if result is error(exception of any kind) or HTTP code other than 200 it should retry.

If result is HTTP code 200, it should check:
a. does the preimage included in the response -> continue the swap
b. if not, does retry has the value of false -> if so, continue the raiden flow.
c. else retry.

If the json can't be parsed -> retry
if the json don't have the retry attribute or it has an invalid value -> retry
if the preimage can't be hashed to the secret -> crash/retry

@erkarl is this how you see it?

Note that while the above logic is sound, it is over complicated with not real need. We can just use 2 HTTP codes that signal - OK, NOT FOUND, ERROR.
We need to keep in mind that this interface is not just for XUD but is more generic and may be used by others.

Another comment:

  • when we provide retry=true we must be sure that the payment was not done or failed. The risk here is of double payment and I'm not sure that there is a safeguard in place to protect from it.

a. does the preimage included in the response -> continue the swap
b. if not, does retry has the value of false -> if so, continue the raiden flow.
c. else retry.

That's right.

Note that while the above logic is sound, it is over complicated with not real need. We can just use 2 HTTP codes that signal - OK, NOT FOUND, ERROR.
We need to keep in mind that this interface is not just for XUD but is more generic and may be used by others.

Yes this is my concern as well, but that's also why I don't want to use a client error status code for errors that don't pertain to the client. "Not found" is not intuitive to me as necessarily a permanent failure in the resolver context (it could be that we can't find it now, but we might find it later). It's also worth mentioning that 404 will be the status code if the wrong path is used against xud, or if the right path is used against the wrong server many servers will respond with 404 as well, and we wouldn't want to fall back to the raiden swap in those cases.

when we provide retry=true we must be sure that the payment was not done or failed. The risk here is of double payment and I'm not sure that there is a safeguard in place to protect from it.

I tried to do exactly this in #1297 with separate internal errors for "final" payment errors and "unknown" payment errors.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kilrau picture kilrau  Â·  6Comments

kilrau picture kilrau  Â·  4Comments

moshababo picture moshababo  Â·  5Comments

moshababo picture moshababo  Â·  7Comments

kilrau picture kilrau  Â·  5Comments