The promise of ZeroTier, as I understand it, is to provide a virtual "flat" network over the mess of NAT, firewalls, VPNs and shifting IPs. Bytes go into zt0, magic P2P NAT-traversing stuff happens, bytes pop out of another zt0 and everyone's happy.
The thing is, the whole networking mess is still there underneath, and when it breaks, ZeroTier leaves you up a creek without a paddle. One is unable to debug further, because there are no logs and no debugging tools.
In my case, I have ZeroTier linking two servers in my LAN (plus my phone). Every now and then zt0 just stops working for around a minute, before resuming. During these outages I can ping and SSH to the other server, so it's not a general TCP problem. Running zerotier-cli on each end tells me nothing: the list of peers and networks is unchanged during an outage.
I'd like to use ZeroTier in production, but can't until it gives me at least a fighting chance of diagnosing problems.
I understand there is some 'circuit testing' code written, but I couldn't find an issue or spec, so thought I'd write up my use-case here. What I'd like to see is a rough equivalent of 'mtr' (traceroute) output, showing hops from here to there. E.g.:
# zerotier-cli traceroute anotherhost.zt
Start: Wed Aug 17 12:56:39 2016
HOST: myhost.zt Loss% Snt Last Avg Best Wrst StDev
1.|-- gateway 0.0% 4 0.5 0.5 0.5 0.6 0.0
2.|-- lo0.lns20.syd4.on.ii.net 0.0% 4 12.6 12.0 11.5 12.6 0.0
3.|-- xe-11-2-1.cr1.syd4.on.ii. 0.0% 4 12.7 16.1 11.9 27.8 7.7
4.|-- ae5.br1.syd4.on.ii.net 0.0% 4 12.0 12.7 11.9 15.0 1.4
5.|-- 72.14.221.174 0.0% 4 11.9 12.1 11.5 12.5 0.0
6.|-- 216.239.41.77 0.0% 4 12.5 12.4 11.8 12.9 0.0
7.|-- 209.85.244.15 0.0% 4 12.0 12.1 11.9 12.3 0.0
8.|-- anotherhost.zt 0.0% 4 12.0 12.1 12.0 12.2 0.0
(Perhaps with a '-n' option to suppress mapping of device IDs to hostnames)
Most importantly, if a connection cannot be established, there should be some indication of why: what routes were attempted and at what hop did they fail.
Another idea: since zt0 works as a sort of "virtual ethernet" like VxLAN, could it log unencrypted packets in 'pcap' format, so that zt0 traffic be analyzed by tools like Wireshark, and reconstituted back into TCP and higher layers? A 'pcap dumper' feature might enable users to self-debug all sorts of weird problems that are otherwise reported as zerotier bugs.
ZeroTier in fact has two... uhh... layers. Can't call them tiers.
These are VL1 and VL2. VL1 is the P2P network and is our 'virtual wire.' VL2 is a lot like VXLAN and runs over VL1 and uses VL1 to talk to its SDN controllers.
I've been asked many times why ZT has no logging. Part of it is our feature minimal philosophy, but part of it is also because I'm not sure logging is the best diagnostic tool. I think something like a VL1 version of tcpdump (ztdump?) would be better. We might build that so I retitled this.
There is also a facility in the ZeroTier protocol to allow network controllers to do tests remotely. It's called circuit testing. We've been gathering some data with this over some networks for a while but so far we have no visualization or UI of any kind for this. This is also on the queue.
Circuit tests are cool. The controller can send a source routed probe that traverses an arbitrary graph of nodes in a VL2 virtual network, reporting back at each hop. (It's crypto auth'd and only the controller for a network you have joined can do it, and only among members of that net, etc.) It provides a lot of data including physical paths, latencies, direct vs. indirect, etc.
A ztdump would be excellent, as it allows the motivated to debug as far as they want. At least in theory - in practice it might be hard to trace from VL1 packets to application-level problems. Wireshark has a "Decode As: VxLAN" feature: we'd kinda need an equivalent "Decode As: ZeroTier VL2". Also, the new PcapNG format has metadata -- setting decent metadata could help users make the VL1 -> VL2 -> application mental translations.
A concrete test might be: pull out the ethernet port in the middle of a TCP connection. From just a PCAP file, how easy is it to tell what happened?
It would be nice if ztdump could rotate capture files, like tcpdump's -W and -C, so that intermittent failures could be retroactively debugged by capturing a rotating log the last day's traffic.
Baring the implementation of this ticket, is there any way right investigate why two nodes can't create a link between each other?
Right now most of the time ZeroTier just works which is awesome. But sometimes it doesn't, when it doesn't it is very hard to find out why. It is even hard to find out if UPnP is working correctly.
I'm having a hard time debugging and am not sure what I should be using if not tcpdump/traceroute. What should I be using?
@WyseNynja Both traceroute and tcpdump are good. For instance, troubleshooting your docker 6plane setup, I get the following traceroute on a successful connect between containers on different hosts:
/ # traceroute6 fc7b:59ab:4811:901c:40ea::2
traceroute to fc7b:59ab:4811:901c:40ea::2 (fc7b:59ab:4811:901c:40ea::2), 30 hops max, 72 byte packets
1 fc7b:59ab:48c0:b0f4:74a5::1 (fc7b:59ab:48c0:b0f4:74a5::1) 0.011 ms 0.008 ms 0.005 ms
2 fc7b:59ab:4811:901c:40ea::1 (fc7b:59ab:4811:901c:40ea::1) 14.245 ms 14.328 ms 14.228 ms
3 fc7b:59ab:4811:901c:40ea::2 (fc7b:59ab:4811:901c:40ea::2) 14.607 ms 14.591 ms 15.435 ms
/ #
Here you can see the 6plane routing. Now, if it would stop after the 1st hop, I would do tcpdump on the local host container / bridge / zt interfaces to see how far it gets. In case it stops after the 2nd hop, I would start with tcpdump on the remote host interfaces involved.
Closing since remote tracing is going to address this, and more.
Most helpful comment
Baring the implementation of this ticket, is there any way right investigate why two nodes can't create a link between each other?
Right now most of the time ZeroTier just works which is awesome. But sometimes it doesn't, when it doesn't it is very hard to find out why. It is even hard to find out if UPnP is working correctly.