Go-ipfs: Relay Infrastructure Integration

Created on 1 May 2018 · 21Comments · Source: ipfs/go-ipfs

Over the last few weeks (months) we have witnessed deteriorating DHT performance:

We have reached a tipping point: Diagnosing the network with a dht crawler has revealed that 61% of the DHT is undialable.

The way forward, in the short/medium term, is to deploy relay capacity and integrate a set of known relays into the go-ipfs distribution itself.
This has that added benefit that it will greatly improve connectivity with the browser world, so it's not a shallow fix.

This issue is here to discuss and track progress towards relay infrastructure integration.

Source

vyzo

❤7 👍7 🎉5

Most helpful comment

@vyzo we need a way not to depend on go-ipfs being shipped with the right set of relays. It is becoming a problem with bootstrap nodes right now. We will need a layer of indirection for this.

Kubuxu on 2 May 2018

👍3

All 21 comments

cc @lgierth @diasdavid @Stebalien @dryajov @mgoelzer

whyrusleeping on 1 May 2018

cc myself

Kubuxu on 1 May 2018

❤2

The way forward, in the short/medium term, is to deploy relay capacity and integrate a set of known relays into the go-ipfs distribution itself.

How would this be done? I always though we would have our bootstrap nodes be relays as well, it seems the most sensible way of doing it, unless I'm missing something.

dryajov on 2 May 2018

4992 will also be needed

whyrusleeping on 2 May 2018

I always though we would have our bootstrap nodes be relays as well, it seems the most sensible way of doing it, unless I'm missing something.

There is the issue of separation of concern and scalability.

We want relays to be dedicated nodes with fast network connections and no random connections/bitswapping. This will allow the relay nodes to keep their (passive) connections open for long periods of time and we also want them to be accessible with a multitude of transports (eg wss for browser nodes).

And we really don't want to kill the bootstrap nodes by overloading with relay duties, imagine what will happen if 1M browser nodes hit them.

vyzo on 2 May 2018

👍1

This will allow us to start advertising relay addresses without interfering with direct dials.

vyzo on 2 May 2018

https://github.com/ipfs/go-ipfs/issues/4993 is very much an essential component.

vyzo on 2 May 2018

@vyzo we need a way not to depend on go-ipfs being shipped with the right set of relays. It is becoming a problem with bootstrap nodes right now. We will need a layer of indirection for this.

Kubuxu on 2 May 2018

👍3

@Kubuxu agreed

We've been working on rendezvous as the long term solution:

Considerable thought in the design went towards engineering a long term solution for scalable relay infrastructure and open relay capacity pools. But we are not there yet!

The fixed relays are intended as the bootstrap step towards a fully open relay infrastructure for the ipfs network.
In the short/medium term, we will use them to deploy the initial relay capacity. This will allow us to stress test the protocols and infrastructure in the current core network scale (~2.5K - 4.5K nodes).
It will also give us insights and data to better engineer the long term network structure.

vyzo on 2 May 2018

The way to add this I think, would be to make it an implicit default. So if the values are not set in the config file, use the default ones. this allows us to easily change the values across versions of ipfs, while still retaining the ability for users to specify their own relays.

whyrusleeping on 3 May 2018

👍2

Curious if anything further has happened with this? I threw up a few nodes after cloudflare made their annoucement and I'm quite surprised at some of the traffic. There are a handful of hosts getting hit with several thousand connections per second. The connections themselves were very short, sometimes not even a single data packet. My hosts are on 10G but I have to wonder who else isn't so fortunate? The nodes settled quite nicely after blocking,

h1z1 on 22 Sep 2018

We've added a bunch of DHT nodes.

There are a handful of hosts getting hit with several thousand connections per second. The connections themselves were very short, sometimes not even a single data packet.

Those may be parallel dials to different listeners on the same node. We currently dial in parallel as some transports/addresses work in some situations and others don't.

However, I'd usually expect at least one data packet in most cases.

Stebalien on 24 Sep 2018

Indeed there are some nodes with many listteners but I've yet to find one with a thousand :) To be clear, in the case with no data packets I simply see SYN,SYN/ACK,FIN.

Perhaps the wrong place to ask (I'll edit this if so), but may be related is there are a lot of 0 byte packets on established connections. Not entirely sure why GO is defaulting to disabling nagle? Results in tiny packets.

edit: I take that back, there are several. Couple are 10k+ (seriously?). One of them in China is advertising 15 _thousand_ ports and is indeed blocked. None of the ports tested were open.

h1z1 on 25 Sep 2018

Not entirely sure why GO is defaulting to disabling nagle? Results in tiny packets.

Ideally, we'd handle this in userland.

edit: I take that back, there are several. Couple are 10k+ (seriously?). One of them in China is advertising 15 thousand ports and is indeed blocked. None of the ports tested were open.

Is that a single peer or a single IP. If it's a single IP but multiple peers, it's probably a large NAT. Otherwise, it may be due to a bug we had where long-lived nodes never cleaned up old "observed" addresses (that likely never worked in the first place).

However, that wouldn't explain your inbound connections (unless your node is also advertising a ton of addresses).

Stebalien on 25 Sep 2018

Not entirely sure why GO is defaulting to disabling nagle? Results in tiny packets.

Ideally, we'd handle this in userland.

Is that something already being worked on ? Was looking at how to set socket options in GO as a quick hack, couldn't find anything.

edit: I take that back, there are several. Couple are 10k+ (seriously?). One of them in China is advertising 15 thousand ports and is indeed blocked. None of the ports tested were open.

Related: libp2p/libp2p#47.

Is that a single _peer_ or a single IP. If it's a single IP but multiple peers, it's probably a large NAT. Otherwise, it may be due to a bug we had where long-lived nodes never cleaned up old "observed" addresses (that likely never worked in the first place).

Both. it' was a single peer with one IP.

However, that wouldn't explain your _inbound_ connections (unless your node is _also_ advertising a ton of addresses).

Just the public IP address is being advertised. Kinda surprised there are no default filters for private addresses but I added them. Though they are blocked / unrouteable at the edge, IPFS keeps trying relentlessly.. It cut back on a lot of crap trying to go out too.

$ ipfs swarm filters
/ip4/172.16.0.0/ipcidr/12
/ip4/169.254.0.0/ipcidr/16
/ip4/100.64.0.0/ipcidr/10
/ip4/192.168.0.0/ipcidr/16
/ip4/10.0.0.0/ipcidr/8
/ip4/0.0.0.0/ipcidr/32
$

h1z1 on 26 Sep 2018

Is that something already being worked on ? Was looking at how to set socket options in GO as a quick hack, couldn't find anything.

We could enable it but that's not necessarily the correct choice. In fact, we create our own sockets (we needed to set an option we couldn't) so we mimicked go's choice here.

When I said "in userspace", I meant "we should buffer up small writes before passing them to the kernel":

Syscalls are expensive.
We have our own "packet framing" in userspace. For example, our transport security protocol needs to MAC each write.

See: https://github.com/ipfs/go-ipfs/issues/4280

Both. it' was a single peer with one IP.

Sounds like that bug I was talking about.

Kinda surprised there are no default filters for private addresses but I added them.

You can enable the server profile (ipfs config profile apply server). These don't exist because we want to be able to dial nodes on the local network. However, we do need to get smarter about when to advertise and when to dial these nodes.

Stebalien on 26 Sep 2018

Is that something already being worked on ? Was looking at how to set socket options in GO as a quick hack, couldn't find anything.

We could enable it but that's not necessarily the correct choice. In fact, we create our own sockets (we needed to set an option we couldn't) so we mimicked go's choice here.

Is that something I can add even as a test?

When I said "in userspace", I meant "we should buffer up small writes before passing them to the kernel":

Syscalls are expensive.

We have our own "packet framing" in userspace. For example, our transport security protocol needs to MAC each write.

See: #4280

Not quite following. You would still do the MAC in userspace. A problem with doing push on everything is you can negate the kernels ability to buffer properly under high load. A simple test is to block a peer with an open connection. It will completely misbehave to the point it times out entirely too early (~10 seconds?). One interesting bug with that is something continues sending data on the socket despite having sent a FIN.

Both. it' was a single peer with one IP.

Sounds like that bug I was talking about.

Is there any way to block peers that haven't updated?

Kinda surprised there are no default filters for private addresses but I added them.

You can enable the server profile (ipfs config profile apply server). These don't exist because we _want_ to be able to dial nodes on the local network. However, we do need to get smarter about when to advertise and when to dial these nodes.

Did not know about that! Likely use test if I can find what all it does.

h1z1 on 26 Sep 2018

Is that something I can add even as a test?

Sure. See the calls to setNoDelay in https://github.com/libp2p/go-reuseport. You can then run the following in the go-ipfs source-code root to rebuild with your modified reuseport:

make deps
PATH="$(pwd)/bin:$PATH" gx-go link "$(gx deps -r | awk '($1=="go-reuseport") { print($2); }')"
make build

(we're hoping to kill this module off in the near future).

Not quite following. You would still do the MAC in userspace.

My point is that, by doing our buffering in userspace, we can send less data and make fewer syscalls. For example, if we buffer before our security transport, we can MAC larger chunks of data, amortizing the size of the mac over these larger chunks of data. If we buffer before our stream multiplexer, we can pack multiple writes into a single stream data frame, saving the framing data. However, we obviously don't do enough buffering so it may make sense to turn on nagle's algorithm (but we'd have to test it).

The downside of nagle's algorithm is that it can add latency. If we (in userspace) know that we're unlikely to send any additional data, we'd like the kernel to just send the packet immediately rather than waiting. Unfortunately, we do have some protocols that like to send one small latency sensitive packet periodically.

Really, we just need to test more.

A simple test is to block a peer with an open connection. It will completely misbehave to the point it times out entirely too early (~10 seconds?).

Not sure if I understand. What are you blocking and what times out?

One interesting bug with that is something continues sending data on the socket despite having sent a FIN.

From remote peers? That sounds like network reordering.

Is there any way to block peers that haven't updated?

In theory but not easily. However, you can follow https://github.com/libp2p/libp2p/issues/47. Basically, we're just going to start trimming these massive lists down.

(sorry for entirely derailing this issue, this is just a good conversation to have and I don't want to stall it)

Stebalien on 27 Sep 2018

Is that something I can add even as a test?

Sure. See the calls to setNoDelay in https://github.com/libp2p/go-reuseport. You can then run the following in the go-ipfs source-code root to rebuild with your modified reuseport:
make deps
PATH="$(pwd)/bin:$PATH" gx-go link "$(gx deps -r | awk '($1=="go-reuseport") { print($2); }')"
make build
(we're hoping to kill this module off in the near future).

Reuseport is a different thing though no? I can't find any reference to setNoDelay though that could 100% be me not understanding GO :)

Not quite following. You would still do the MAC in userspace.

My point is that, by doing our buffering in userspace, we can send less data and make fewer syscalls. For example, if we buffer _before_ our security transport, we can MAC larger chunks of data, amortizing the size of the mac over these larger chunks of data. If we buffer before our stream multiplexer, we can pack multiple writes into a single stream data frame, saving the framing data. However, we obviously don't do _enough_ buffering so it may make sense to turn on nagle's algorithm (but we'd have to test it).

The downside of nagle's algorithm is that it can add latency. If we (in userspace) know that we're unlikely to send any additional data, we'd like the kernel to just send the packet immediately rather than waiting. Unfortunately, we do have some protocols that like to send one small latency sensitive packet periodically.

Indeed, curse of any vpn and SSH. Such small packets however allow one to infer the type of data (timing attacks for example). You can amortize the latency and gain more efficiency by bursting. TCP_CORK defaults to a maximum 200ms window, certainly noticeable for time sensitive applications like ssh, but tcp itself purrs along nicely.

Really, we just need to test more.

A simple test is to block a peer with an open connection. It will completely misbehave to the point it times out entirely too early (~10 seconds?).

Not sure if I understand. What are you blocking and what times out?

Simulating failures by blocking an established connection using iptables then watching how it responds with tcpdump. A properly functioning connection will detect the condtion and backoffl (congestion avoidance) until either their send/receive queue pops or a protocol specific watchdog kicks in (ie keepalive). What happens with ipfs (or GO itself?), is it gives up.

IP ext01.1050 > ext02.1909: Flags [P.], seq 240:288, ack 241, win 1444, options [nop,nop,TS val 863879468 ecr 2811783365], length 48
 IP ext01.1050 > ext02.1909: Flags [P.], seq 240:288, ack 241, win 1444, options [nop,nop,TS val 863879616 ecr 2811783365], length 48
IP ext01.1050 > ext02.1909: Flags [P.], seq 240:288, ack 241, win 1444, options [nop,nop,TS val 863879776 ecr 2811783365], length 48
IP ext01.1050 > ext02.1909: Flags [P.], seq 240:288, ack 241, win 1444, options [nop,nop,TS val 863880088 ecr 2811783365], length 48
IP ext01.1050 > ext02.1909: Flags [P.], seq 240:288, ack 241, win 1444, options [nop,nop,TS val 863880768 ecr 2811783365], length 48
IP ext01.1050 > ext02.1909: Flags [F.], seq 288, ack 241, win 1444, options [nop,nop,TS val 863881968 ecr 2811783365], length 0
 IP ext01.1050 > ext02.1909: Flags [FP.], seq 240:288, ack 241, win 1444, options [nop,nop,TS val 863882048 ecr 2811783365], length 48
 IP ext01.1050 > ext02.1909: Flags [FP.], seq 240:288, ack 241, win 1444, options [nop,nop,TS val 863884544 ecr 2811783365], length 48

IP's obfuscated, ext01 (remote) was blocked egress from ext02 (local).

You can artificially cause this with tc and netem (would let you see the effect of various buffer sizes as a bounus). You'll likely need to increase the OS tcp buffers from their default (4096 is way too tiny).

One interesting bug with that is something continues sending data on the socket despite having sent a FIN.

From remote peers? That sounds like network reordering.

Could be. Doesn't appear to happen with all peers.

Is there any way to block peers that haven't updated?

In theory but not easily. However, you can follow libp2p/libp2p#47. Basically, we're just going to start trimming these massive lists down.

It's a bit of a slipery slope becaus eon one hand by doing so you're determining abuse but also have very little detail as to what is abuse and say.. a misconfigured host. I could fully see someone spawning one process per CPU without understanding there are other technologies already handling that (GO directly for example). Or simply forgot to add the appropriate SNAT rules.

(sorry for entirely derailing this issue, this is just a good conversation to have and I don't want to stall it)

Honestly I didn't know where else to bring this up either with all the projects, lists, irc etc. It's still related to the performance of DHT and relaying.

h1z1 on 28 Sep 2018

An update on recent progress:
We are almost ready to merge autorelay support in libp2p: https://github.com/libp2p/go-libp2p/pull/454

vyzo on 20 Oct 2018