Zerotierone: ZeroTier can't find shortest path between "mostly" local peers

Created on 23 Aug 2015  路  23Comments  路  Source: zerotier/ZeroTierOne

I have an issue where ZeroTier doesn't find the shortest/peer-to-peer between "mostly" local machines on my setup. A connection is established, but going by the pings it is either going through the relay, or at the very least takes a roundabout way.

Better description added in comment.

Let me explain my actual set-up first.

  • Our entire apartment is connected via a switch set up by our ISP.

    • This, in turn connects to the ISP's WAN which is behind a port-restricted NAT, everyone has the same external IP.

    • All our connections have unique LAN addresses within this local network, and the WAN.

    • Even when the connection between our apartment's network to the ISP is down, whether physically or via any software issues on their local node (it happens often enough for us to be able to tell), our apartment's switch being "dumb" works just fine, and we can access each other's machines if directly connected.

  • I am trying to connect between my desktop called A...

    • Which is behind my home wifi router and connected to it via ethernet.

    • My unique ISP LAN/WAN address goes to this router.

    • The router has a port-forward to my desktop for TCP/UDP port 9993.

  • ...to my neighbor's laptop called B, who is on the same apartment.

    • But this machine B is directly connected to the apartment's switch, so his unique ISP LAN/WAN address goes directly to his laptop, machine B.

  • I have already set-up my personal network via the ZeroTier One administration interface.

In this configuration, machines A and B are able to connect to each other just fine, but have a ping of anything between 200 to 600ms! Clearly, it is not the shortest path, and isn't going over our local network, nor the WAN, where we have a ping of 1 to 4ms.

But, the moment I unplug my WAN cable from my router and plug it directly onto my desktop, machine A, ZeroTier immediately finds the shortest path over our local network, with the usual ping of 1 to 4ms.

This is expected, of course, But unexpectedly, when I immediately plug my WAN cable back onto the router, going back to the previous config where the connection was going over relay, it doesn't switch back to that relay, but remains on that short path over our local network just fine! It only break and goes back to the relay if we stop sending data over this short connection for a little while, and then when it's reestablished, it does so over the long path.

To recap. My machine A which is behind my router has ports forwarded on the router, but that is not enough for ZeroTier to find the shortest p2p path.

Also, I have tested this set-up by enabling DMZ on my router so that machine is takes its place, to no difference. As long as a connection starts behind a router, it only finds the long path, but when even temporarily connected directly it finds the shortest path and remains on that path even when connected back to the router!

Additionally, I have also tried connecting to another neighbor who uses a router as well. The same thing happens. We BOTH have to connect directly to our apartment's local network once for the path to be established, either one of us doing so doesn't make a difference. But when we do connect directly very briefly and go back to our routers, the short path is retained as long as we are sending data over it!

Is this expected behavior? I was under the impression that as long as the NATs are traversable (and at no point, whether at our homes or the ISP do we encounter a symmetric NAT, indeed, other virtual networks such as Hamachi work fine over our setup,) ZeroTier should find the shortest path. Can anything be done about this?

P.S. All these tests have been done across all combinations of Windows 10 and Linux (Xubuntu 14.04).

question

Most helpful comment

(I just drew it from scratch in Inkscape, and thank you!)

All 23 comments

First question I have to ask is what version you're running. If it's before 1.0.4, upgrade.

This week's test were all done on 1.0.4

Let me read this back to you and see if I understand:

  • There is one NAT for your apartment, and everyone is behind it. They're connected to a normal switch (no switch-level filtering) and can all see each other.
  • Machine A is behind a second NAT that is plugged into this first switch.
  • Machine B is not behind this second NAT -- it's plugged into the switch.

Can machine A ping machine B's "apartment level" IP address? The only way I can see this working is if A informs B of its "apartment level" address and B can contact it. Otherwise it's going to be tough.

Have you tried turning off local firewalls on these systems to see if that changes things?

  • The NAT for my apartment is the one for our entire WAN which the apartment's switch connects to transparently, but the rest of it is the same.
  • Machine A is behind a second NAt that is plugged into the apartment's switch.
  • Machine B is not behind any additional NATs, it's plugged directly into the apartment's switch.

Machine A can ping machine B using the apartment level (which is the same as our ISP's WAN level) IP address the latency characteristics of which I was using to see if a path is the shortest possible (1ms to 4ms), and machine B can also ping machine A using these addresses.

All the tests were performed after turning off the Windows firewalls and also the firewall on the router with the additional NAT machine A is behind. But for the combinations with Linux involved I had not turned off Xubuntu's UFW, but I did add a rule to allow 9993 UDP just in case. (It seemed to make no difference. ZeroTier did establish connection with the same characteristics even with UFW on and with or without the rule.)

Sorry for the confusing mess of text. I wasn't sure how to describe it otherwise.

Clarification, machine B can ping machine A only after turning on a setting on the home router (with the additional NAT that machine A is behind,) to allow responding to pings from this home router's WAN (which is the apartment level LAN machine B is directly connected to.) So I assume the router is replying to the pings in this case, and not Machine A directly. That's expected right, since this router is the machine with the apartment-level IP address that machine B knows of?

However, if I set machine A itself as the DMZ target for this home router, when machine A is replying to the pings (and machine A's firewall can block the pings if turned on, and the router's setting has no effect), the same remains true.

It's very strange then that they don't find one another.

ZeroTier 1.0.4 will concurrently try three methods for direct connectivity:

  1. It will attempt its classic "transport triggered NAT traversal" in which the root servers send a message to either side telling them what their external address is and they then attempt to connect. This will only work behind conventionally traversable NATs.
  2. It will send other peers (on common networks) a message enumerating its local IP addressing information on its local LAN. Peers will then attempt to contact it there. If that's successful, these links will be used in preference.
  3. It will attempt to use uPnP/NAT-PMP to configure its local router to allow remote packets, and if that's successful it will send these addresses along with local IP info as in mode 2.

So all three of these must be failing in this case.

The only thing I can possibly imagine here is that something is filtering packets between A and B. Does the apartment switch have any filtering logic or is it just a dumb switch?

It is a dumb switch as far as I can tell. It's on our roof and not even powered with a power outlet; getting it from the ISP-side cable I assume like some hubs do?

And if I connect machine A directly to the switch (removing the router) the two machines can find each other on the local network (1~4ms) immediately, so the switch must not be filtering even if it's not dumb, right?

And if I connect the router between the switch and machine A _after_ establishing the path like this, that direct short path is retained even behind the router, so it shouldn't be the router filtering things either, or so I think. (The router does have port 9993 forwarded to machine A under it, and also has UPnP turned on.) This confused me even more.

Is there any way for me to log ZeroTier to see when one of the discovery/traversal methods are failing?

[Apologies for late reply.]

I think this diagram may describe our physical network set-up better than my wall of text.

isp-physical-network-diagram issue

( I'm using multiple named machines to designate the configurations I tested. The names are weirdly ordered because machine A B and C from my textual description are in their original configurations on the diagram as well. )

Setup:

  • The central ISP NAT as well as all the home router NATs are all traversable.

    • They are all non-symmetric and were tested with relay-less WebRTC PeerConnections as well.

  • All tests were performed over all combinations of Windows 10 and Xubuntu Linux 14.04
  • ZeroTier version tested was 1.0.4

    • Earlier tests were also done on 1.0.3 with same results.

  • Windows Firewall and firewalls on any of the home routers with tested machines behind them were disabled for these tests.
  • The UFW firewall on Xubuntu was not turned off, but the following rule was added: sudo ufw allow 9993/udp.

    • It's worth nothing that even without this rule machines H and B were able to find each other fine.

  • The home router machine A and H is behind is configured to port forward both UDP and TCP traffic on port 9993 to machine A.

Physical Network tests:

  • All of the machines under the same Local Cable Operator Subnet (H, and A to E,) can ping each other with 1 to 4ms latency.

    • In case of home routers, the routers themselves were replying to these pings after turning on that setting, and not machines behind these routers.

ZeroTier virtual network tests:

  • All the machines named can connect to the network, and each other, but many with high enough latency (200 to 600ms) and Internet connection dependency to suggest non-local paths.
  • Machine A and B are NOT connected over their shortest path.
  • Machine A and C are also NOT connected over the shortest path.
  • But machine H and B ARE connected over the shortest path.
  • Machine I and A ARE also connected over the shortest path.
  • Special case:

    • If machine A is physically disconnected from its configuration and momentarily connected directly onto the switch like H and B are under, B and and the newly positioned A are immediately able to see each other over the shortest path.

    • AND, if machine A is _returned_ to it's original position following this short-path discovery, it continues to communicate with machine B over this short local path despite not being able to discover it previously.

    • This persistent short path remains as long as it's used. After the machines are disconnected for a while, it resets, and next time the machines try to connect to each other over ZeroTier while A is behind the router, the short path isn't found.

  • Tests with machine A were also done by setting machine A as the DMZ target on the home router it is behind to no difference in results.

(Psst, what software did you use to generate that image? looks smooth!)

(I just drew it from scratch in Inkscape, and thank you!)

Going to think on this a bit, but also wanted to mention that you can build the service with tracing enabled:

make ZT_DEBUG=1

Then run it in a terminal with:

sudo ./zerotier-one

It will dump a very large amount of tracing output, which may be helpful.

Also going to reference #86 -- an old placeholder ticket that refers to our need for better diagnostics.

Thank you, I will attempt to build with tracing on.

New Tests

I have compiled with debug trace on and have ran some tests this weekend.

The test in question was between A and C from the diagram (as B, the machine without a home router, was unavailable then.) And yet more confusingly, I tried the tests with identical configurations (at first,) multiple times a day, with drastically different results!

Starting setup:
  • The only difference in set-up this time was that instead of port-forwarding 9993 TCP and UDP on machine A and C's home routers to the respective devices I port-forwarded 9993 to the broadcast address, thinking that it might be worth a try. (Not having that rule made no difference in all earlier tests.)
  • A was running Xubuntu 14.04, and C was running Windows 10.
  • Firewalls were all off, as before.
Morning Tests:
  • A could immediately ping C with a value of latency that I had never tested earlier, between 8~16ms.

    • This was _not_ the shortest path, but much much faster than previous routes which I assumed were being tunnelled.

    • I can only assume this time the data is just leaving the ISP WAN and then coming back again, resulting in this small latency, that is nonetheless not short, subject to external Internal bandwidth limitations and routine packet drops.

  • As soon as and only after _C began pinging A,_ the latency for BOTH ping directions dropped to 1~4 ms.

    • This is certainly the shortest path. Was not subject to Internet bandwidth limits, nor did it have the routine packet drops we face from the ISP level.

  • From that point in, A and C were connected at the shortest path, but:

    • But, apps like D-LAN and Kouchat were _not_ able to discover the other machine on even this network.

    • This is weird, as before, even through the long paths, they could easily discover each other, but not anymore!

    • Samba however was able to discover everything just fine, and share files upto physical LAN speeds.

    • Mumble/Murmur was able to connect just fine given the ZeroTier IP.

Changed setup:
  • In the afternoon, I wanted to run this test with A on Windows 10, just in case. But before rebooting, I tried to ping from Linux first anyway.
  • The following test was performed with the two configurations A on Xubuntu 14.04 and C on Windows 10 _and_ both machines on Windows 10!
Afternoon Tests:
  • A could ping C, but it was once again the low but non-shortest path 8~16ms.
  • But when C pinged A nothing changed.

    • Both ping directions remained at 8~16ms, rather than 1~4ms.

    • And was subject to Internet bandwidth limitations rather than WAN ones.

  • As before, I tried some app tests.

    • D-LAN and Kouchat were not able to see each other.

    • Samba did discover each other, but with slow Internet bandwidths, rather than local speeds.

    • Mumble/Murmur could connect again, but with higher latency, I suppose, although imperceptible.

Changed setup:
  • At this point, I tried changing our router's port-forward settings, even though the setting had not changed since the morning when it worked, but I could think of nothing else, and I had changed nothing else since last week.

    • I tested all six combinations with A and C's routers with the following three port-forward rules:

    • 9993 UDP/TCP to respective PCs.

    • 9993 UDP/TCP to broadcast.

    • No rules at all.

    • None of the configurations had any effect. It remained at 8~16ms.

    • We went back to the broadcast rule, just in case.

  • While I label this next test as the evening one, we actually never stopped testing from the afternoon-setup.

    • So at this point both machines are on Windows 10.

Evening Tests:
  • A and C were both pinging each other and both pings went back upto that long path I noticed for all the tests before this one, anywhere between 200 to 600ms!
  • Special case:

    • As already mentioned, we never stopped pinging each other since the afternoon test, so after about an hour or so the ping characteristics changed by themselves, without us making _any_ changes at all.

    • And we were both connected and talking on Mumble, so we never stopped using the ZeroTier connection as well!

    • This is the first time I've ever noticed such a change happen without us changing anything, or idling and re-creating the path.

Logs

At this point I reboot back into Xubuntu to upload the logs and such, and discover that, even though I was running the debug ZeroTier build since the morning on Linux and thinking that I was logging the output to a file, I actually had no record of the output at all since ZeroTier trace outputs to stderr rather than stdout!

I then kill zerotier-one and then run it again while logging stderr this time, and leave it running while I type this. None of the setup has changed, A and C right now have 500~600ms latency in both directions, and this is all the log I have tonight.

To recap setup during the log:
  • Same vase as this post.
  • But both home routers for machine A and C have port-forwards for 9993 TCP/UDP to the broadcast address instead of themselves.
  • Machine A is on Xubuntu 14.04 running the debug ZeroTier binary versioned 1.0.5
  • Machine B is on Windows 10 running the release binary versioned 1.0.4
Gist with trace output log.

I just realized something. The biggest thing that changed between the afternoon and evening tests were the fact that my neighbor with the machine C had idled out his ISP login and had to log back in again, and was assigned a different external IP than before!

We were both on 223.223.131.10 in the morning, which was the last time I looked at this detail, and now only machine A was on 223.223.131.10 and machine C was on 10.42.156.102.

Machines A and C are still on the same local apartment LAN, and were pinging each other constantly over this LAN/WAN on their LAN/WAN addresses on 1~4ms latencies, even for the minute or so machine C was disconnected from the Internet before logging back in with the ISP.

Was thinking about this a little this morning, and two possible explanations come to mind:

  • Some NAT devices map ports by port alone, ignoring the IP. As a result, more than one NAT device on the same port (9993) might mean that only one gets a mapping. This is rare but it exists, hence #228 Perhaps try setting a different port on one end with the -p option.
  • Some NAT devices, apparently for (dubious IMO) security reasons, have logic that actually blocks UDP connectivity if a packet arrives on a given port before one is sent. This is also rare but I've seen it. So if B sends a packet on 9993 to B and this arrives _before_ A sends one to B, A's router/firewall _might_ block connectivity to B for some period of time.
  • Finally, you might try removing the manual port 9993 mapping. This could actually be confusing things. Also maybe try enabling uPnP or NAT-PMP if supported.

Assuming we're talking about the home router NATs:

  • For the devices tested so far the home routers have only had a single mapping on 9993 to a single device at a time (other than the alternative tests I did by mapping to the entire broadcast address.)
  • This is something I also now suspect seeing as how in some configs with home routers it takes _both_ parties to ping each other before the shortest path is found, (although I'm not certain if this behavior would have anything to do with _this_.) I couldn't find any documentation clarifying whether the NAT for both of the Asus home routers tested is of this send-to-receive kind. This is something that I can only test for sure once one of us gets around to flashing DD-WRT on the routers or finding another router that is guaranteed not to have this behavior. Looking at both.
  • Will try testing by removing the mapping altogether. The routers already have uPnP turned on, and there is no NAT-PMP setting on either of the tested routers.

Will also try testing with the Android beta that released today. It might make the combinations more complicated, but I'd have a quicker time putting my portable devices on my neighbors' home WiFis than waiting for the weekends!

Ran some tests with the Android beta. I realize this may not be the appropriate thread to post about the Android beta, but it is somewhat relevant, so I went ahead.

All tests failed to find the shortest path, even over my home WiFi network.

Base Setup:

  • Overriding the earlier diagram and common setup.
  • Machine A is the desktop running Windows 10.
  • Machine I is now my phone.
  • Both of the machines are under my home WiFi network.
  • Both machines were constantly pinging each other.

Test 1:

  • No port forwards on the router for any of the machines.
  • Failed to find shortest path.

    • Ping over 60ms (which I now understand is somewhat brief becayse this is the normal "external" path when both devices have the same external ISP endpoint.)

    • For a period of time after switching configs it reaches the 200~600ms ping that I suspected was the relay then find the 60ms path.

    • Samba shares were tested to be copying over the Internet limited to by Internet bandwidth.

Test 2:

  • TCP+UDP 9993 broadcast forwarded over my home WiFi network.
  • Failed in the same way.

Test 3:

  • TCP+UDP 9993 on the router is forwarded to 9993 on A.
  • TCP+UDP 9994 on the router is forwarded to 9993 on I.
  • Failed in the same way, still!

This is the first time two machines on the same nearest local network were able to fail to find the shortest path to each other.

When _I_ has been a desktop, which is to say when two machines were on the same home WiFi they have _always_ immediately found the shortest path, even without port-forwards (before I started testing.) Not so for the Android device.

I _only_ just realized that the active data-paths I've been testing by ping are actually clearly listed by the program! :disappointed:

At this moment, the Windows client displays the following paths to my Android phone on the same WiFi network:

  • 192.168.1.110/9993 466/598

    • This is the phone on my home WiFi. It can be seen, but is not being used for some reason!

  • 103.44.173.166/1024 2072/2210

    • Both my phone and my desktop's external Internet address is 10.44.159.140 atm. I don't really know what this address is.

  • 103.44.173.166/9993 0/

    • Likewise. But this one is being used!

I will record the all the paths from now on.

Final addition to today's home WiFi tests: I've let the devices ping each other for almost an hour now.

The only path that the Windows client on my desktop to my phone now shows is the home WiFi one:

  • 192.168.1.110/9993 0/0

BUT. And this is now confounded me utterly, the pings are still above 60ms. AND tests such as Samba file-share etc are _not_ working locally but is going over the outside, being limited by my Internet bandwidth, rather than my WiFi's bandwidth.

(For comparison, I often copy files between my phone and desktop over WiFi. It's slower than a PC at this, but I still get around 5Mbps normally, which is a lot more than my 800Kbps Internet connection.)

Additionally, right now I am also pinging my neighbor's laptop which is currently also on my own WiFi network just like the phone is. It is pinging over the shortest path (1~2ms) and has copied files over Samba up to my WiFi's bandwidth just fine.

If Adam can solve this, he deserves a reward.

He deserves a reward anyway, that is why I got a subscription even tho I don't need the private network feature.

I am closing this for two reasons.

  1. As we grow we intend to add a lab where we create insane pathological configurations and then try to make them work. By insane and pathological I am including anything with more than one layer of NAT as well as overlapping conflicting IP address ranges and other horrors sometimes seen in the wild.
  2. Until then, we're going to have to triage _some_ edge cases. We may also implement #168 or some version of it which would allow workarounds for cases like this.
Was this page helpful?
0 / 5 - 0 ratings

Related issues

coretemp picture coretemp  路  4Comments

nolanl picture nolanl  路  4Comments

Fastidious picture Fastidious  路  5Comments

williamheinz picture williamheinz  路  5Comments

kbumsik picture kbumsik  路  4Comments