Go-ipfs: Unreliable p2p streams via http proxy

Created on 26 Aug 2019  路  15Comments  路  Source: ipfs/go-ipfs

Version information:

go-ipfs version: 0.4.22-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.7

Description:

We rely a lot on p2p streams via the http proxy endpoint, and they seem to be more unreliable than I would expect (compared to a wget to the same node). When the target ipfs node is a long running node with a fixed public ip address I would expects dials to that node to almost always work on a normal internet connection. The only custom config we have is:

config --json Experimental.Libp2pStreamMounting true
config --json Experimental.P2pHttpProxy true
config --json Experimental.PreferTLS true

In this case the source node is a local node. The reason for failing varies. For example:

2019/08/26 16:35:40 http: proxy error: context deadline exceeded

http: proxy error: failed to dial : all dials failed

  • [/ip6/2a01:7e01::f03c:91ff:fec1:550c/tcp/4001] dial tcp6 [2a01:7e01::f03c:91ff:fec1:550c]:4001: connect: network is unreachable
  • [/ip4/172.104.245.75/tcp/4001] failed to negotiate security stream multiplexer: read tcp4 10.0.2.15:4001->172.104.245.75:4001: read: connection reset by peer

2019/08/26 16:45:34 http: proxy error: failed to dial : all dials failed

  • [/ip6/2a01:7e01::f03c:91ff:fec1:71e8/tcp/4001] dial tcp6 [2a01:7e01::f03c:91ff:fec1:71e8]:4001: connect: network is unreachable
  • [/ip4/172.104.157.121/tcp/4001] failed to negotiate security stream multiplexer: read tcp4 10.0.2.15:4001->172.104.157.121:4001: read: connection reset by peer

2019/08/26 16:46:34 http: proxy error: max dial attempts exceeded

kinbug topihttp-api

Most helpful comment

That would definitely explain this issue.

All 15 comments

This looks like a timeout. We have a minute-long accept timeout which matches this exactly. After the timeout, the server will reset the connection.

It _could_ be that the server is massively overloaded with new connections and is taking a while to perform the secio handshake. You might want to try building with GOFLAGS=-tags=openssl (requires openssl). Assuming you're using RSA keys, this is significantly faster.

Is 172.104.157.121 advertising itself as a relay, by any chance?

I haven't changed the defaults. The swarm config is:

"Swarm": {
    "AddrFilters": null,
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 900,
      "LowWater": 600,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "DisableRelay": false,
    "EnableAutoNATService": false,
    "EnableAutoRelay": false,
    "EnableRelayHop": false
  }

The target node is a single core server which sits at around 30% cpu usage without any Peergos induced load. It is an RSA keypair.

Hm. Very interesting...

I've been trying to figure out why p2p streams seem to work fine for ipfs-cluster which makes extensive use of them, and not for us. The main difference that comes to mind is I assume that in ipfs-cluster all the nodes are swarm connected to each other and thus maintain connections (is that right @hsanjuan ?). In our case we often try to p2p dial very soon after a node has been started (and it bootstraps using the default bootstrap nodes). Could this be related to timeouts? The performance if we have to search the dht is too slow?

ipfs-cluster all the nodes are swarm connected to each other and thus maintain connections (is that right @hsanjuan ?)

Yes, mostly.

Looking around I noticed a 30-second hardcoded timeout here by the way:https://github.com/libp2p/go-libp2p-gostream/blob/master/conn.go#L83 which seems very arbitrary...

And context is cancelled on return :/ I'm not sure this has anything to do with the issue (seems that context is only used for dialing, and things would have been horribly broken before), but I'm fixing it...

Added this here: https://github.com/ipfs/go-ipfs/pull/6684 . Any chance you can test ?

Thanks for that @hsanjuan I've tried out that branch. I test it by just browsing to
http://localhost:8080/p2p/QmVdFZgHnEgcedCS2G2ZNiEN59LuVrnRm7z3yXtEBv2XiF/http/
which should show a Peergos login page.

The first time after starting the daemon it failed with:

2019/09/27 17:11:26 http: proxy error: routing: not found

Then it worked the next 4 times quickly (including loading all the page assets).

Then on the 5th refresh it hung for 5 minutes with no error before I cancelled it.

This is now a serious problem for us. I've just restarted two ipfs instances (including the dial target above) in the same datacentre. Even after 20 dial attempts the second can't connect to the first. The first is still a single core server with 10% cpu utilisation at the moment.
Even a direct swarm connect command is failing. so nothing to do with http api:

swarm connect /ip4/172.104.157.121/tcp/5001/ipfs/QmVdFZgHnEgcedCS2G2ZNiEN59LuVrnRm7z3yXtEBv2XiF
err: Error: connect QmVdFZgHnEgcedCS2G2ZNiEN59LuVrnRm7z3yXtEBv2XiF failure: failed to dial : all dials failed

I can only think of enabling swarm2 debugging and trying to figure out where it is trying to dial... it is probably choosing the wrong address (if any)...

The problems are normally after a restart. With a sufficiently long wait they are much more stable. Would it help to include our own nodes as bootstrap nodes?

The problems are normally after a restart. With a sufficiently long wait they are much more stable. Would it help to include our own nodes as bootstrap nodes?

@ianopolous without actually figuring out what dials were tried, why and why each failed it is hard to say anything here.

I think that go-ipfs v0.4.23 has fixed this issue, which would mean the issue was triggered by us using TLS instead of secio. I'll do more testing before closing though.

That would definitely explain this issue.

I haven't managed to get it to fail once now, where previously it averaged about 1 in 3 under the same tests. Thank you so much!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

whyrusleeping picture whyrusleeping  路  4Comments

JesseWeinstein picture JesseWeinstein  路  4Comments

Jorropo picture Jorropo  路  3Comments

ArcticLampyrid picture ArcticLampyrid  路  3Comments

magik6k picture magik6k  路  3Comments