Zerotierone: Zerotier difficulties to communicate with planets

Created on 5 Jun 2020  路  37Comments  路  Source: zerotier/ZeroTierOne

We've been testing ZeroTier for some weeks now and our network was working well till this last Tuesday 2nd of June 2020. That day we were connected through remote desktop to some of the machines when suddenly the connection through zerotier failed.

The zerotier connection includes some machines inside a big local network and some domestic machines. The big local network is not managed by us, so we have no real control on it or what it is filtered and we are not able to check logs their logs. But we can contact with the people on charge of it, if we know what to ask for.

Until last Tuesday, all the machines in the Zerotier network were available through remote desktop. On Tuesday, most of the machines on the big local network failed while the domestic machines seems to be still accessible. We said most of the machines, not all of them. There are still 2 or 3 machines inside the big local network that are sill accessible through zerotier network. And others that become alive for some minuts and then offline again.

We don't know what to start looking for and which kind of information compilate to investigate the problem.

As far as we have seen, the machines that are not working in zero tier network appear to give this information when executing: zerotier-cli peers | grep PLANET

34e0a5e174 -      PLANET    -1 RELAY
3a46f1bf30 -      PLANET    -1 RELAY
778cde7190 -      PLANET    -1 RELAY
992fcf1db7 -      PLANET    -1 RELAY

In this case, Zerotier seems to not being able to connect to any of the planets. We guess this is needed in order to work properly. Some of the problematic machines connect only to one or two planets, but not to the four of them.

Those machines that cannot connect with the planets, give this information when executing: zerotier-cli listnetworks

200 listnetworks <nwid> <name> <mac> <status> <type> <dev> <ZT assigned ips>
200 listnetworks XXXXX XXXXX XXXXXX OK PRIVATE ztmjfnnfbw XXXXXXX/24

or this: zerotier-cli status
200 info 585d489748 1.4.6 TUNNELED

Which it aparently says the zero tier client is connected to the network but then:
zerotier-cli info
200 info 585d489748 1.4.6 OFFLINE

And also in the ZeroTier Central these machines appear offline.

We've verified that these machines have the port 9993 open (we've been testing it with telnet command from inside the local network and from the outside when the machine has a public IP). So it seems this port is not blocked. All the machines have acces to Internet and can ping the PLANET IPs. We've also tried to restart zerotier service on them and also the machine itself. We've tested them without any firewall aswell to discard things. But the problem remains.

Are there any other tests we could make so we can get more information and see where the problem remain? The ZeroTier client has any place to look for logs? How we can force zerotier client to reconnect with their planets?(apart of restarting the service which seems not to be working) What could be blocking the access to them?

Any help will be much appreciated it.
Thanks!

bug duplicate management networking & routing

Most helpful comment

For anyone encountering this issue in linux, here is an extremely simple script for linux that can be put in cron. You can schedule it whenever

#!/bin/bash
SERVICECMD="/usr/sbin/service"
SERVICENAME="zerotier-one"
ZTCLI="/usr/sbin/zerotier-cli"
ZTSLEEP=15
ISOFFLINE=`$ZTCLI status | grep OFFLINE`


if [[ -z "$ISOFFLINE" ]]; then
   echo "ZEROTIER is not in an offline status."
else
   echo $ISOFFLINE
   echo "STOPPING SERVICE FOR $ZTSLEEP SECONDS"
   $SERVICECMD $SERVICENAME stop
   sleep $ZTSLEEP
   echo "STARTING SERVICE"
   $SERVICECMD $SERVICENAME restart
   echo "ZEROTIER RESTARTED"
fi

All 37 comments

That's pretty strange. It does look like a firewall/networking issue but you've checked that.

I looked up your network and see it has an interesting setup (with the routes and rules).

Were any of the config changes made on that day possibly?
Do any of your zerotier subnets overlap with any of your physical subnets?
Did a machine get cloned, and now multiple machines have the same ID?

Tangent:
There may be simpler ways to segment your network... either by just using multiple zerotier networks, or by using tags
https://www.zerotier.com/manual/#3_5_4

Hi laduke, thanks for answering!

We did not make any change that day, nor added new nodes or cloned any machine. We lost the ZeroTier conection from one moment to another without apparently any reason.

We think there might be a problem with the big local network (the one we don't have control of it). They might have changed something that is affecting most of the machines with zero tier vpn but not all of them. We would like to know what kind of tests we could do in order to get any evidence of that so we could talk to the people that manages that local network knowing what to ask for.

What kind of test we could make on the machines that are still working versus the ones that cannot connect to the root servers/Planets?

Best regards

Hi again,

Just to discard few more things, we have:

  • Created a new ZeroTier network with another user so it is completly separated from the one that is malfunctioning.
  • Tested the new ZT network with a domestic machine and checked it has no problems to connect with it, so we can authorize it from my.zerotier.com. We try to discard that the problem is anything related with the IDs in the previous network. This new network is pretty simple and has no segmentation.
  • Selected one of the machines in the big local network that is not currently working. Completly unninstalled the ZeroTier client and also removed the configuration files located in ProgramData/ZeroTier (this is a Windows 10 machine).
  • Installed again the ZeroTier client in this machine and the id fcf1dc25ed was given. We tried to join the new network we created before and checked that ZT client is not capable to connect to it. The given status is: REQUESTING_CONFIGURATION PRIVATE ethernet_32775

The request is not seen in my.zerotier.com so we cannot authorize the machine and if we check _zerotier-cli peers_:

34e0a5e174 -      PLANET    -1 RELAY
3a46f1bf30 -      PLANET    -1 RELAY
992fcf1db7 -      PLANET    -1 RELAY
de8950a8b2 -      PLANET    -1 RELAY

It seems the ZT client still cannot contact with ZeroTier central or their planets/root servers even though this is a new network and we have reinstalled everything. The Windows firewall has also been automatically modified by ZT to allow communications from the zerotier client and we have checked that the ZT is running. And we can see that this machine has created a new network adapter for ZT which has an IP address 169.254.x.x that has been automatically given and it is not one valid IP address for the ZT network we created (should be 192.168.195.x)

I have this issue too: yesterday i lost one machine by this way and it doesn't work again until i unistall zerotier, reinstall and reconfigure. The GUI says connected and the client appears as ONLINE in "myzerotier" dashboard too while the cli command "peers" shows at first the status of that machine on RELAY, after a while that node disappeared from the list and the cli command "status" was OFFLINE. In my case, reboot or disconnect/reconnect doesn't fix the issue: i need to uninstall/reinstall as i said before. This is explained in #1146 too.

In our case the problem still reamains even though we uninstall/reinstall the Zero Tier client.
We have no clue why is happenning in some machines while there are some other that have become available on their own without making any change and being in the same local network. It is really strange :/

This problem is happening more frequently lately. Anyone can enlight us about this? Thank you!

We finally got some traces from both a working versus a problematic machine using the software Wireshark.

In the machine where Zerotier was working, we could check there was packets exchange between the machine and the IPs of the ZeroTier planets we have been assigned. On the other hand, the machine where it was not working, we've captured that there was no exchange. The machine sent packages to the planets but the planets never answered back.

With this traces we've been able to talk with the people responsible of the big local network where the problem was located and it seems they have done something so now all machines in that network are available again through ZeroTier. We are still facing some intermitent connection with some of them, but we are not sure if it could be a exceptional case due to network issues of the day or not. We will be testing a litlle more.

Nothing, after every reset, it works for around two/three days then some machines can't connect properly anymore with PLANET servers and they will get the "RELAY" peer connection status. "zerotier-cli status" command shows them as offline too while the web dashboard sees them online. This problem is becoming very annoying sadly...

Hey, we're watching this thread, but are not seeing this anywhere on our own stuff.

Could everyone summarize their environment where they are seeing this? Hosting provider/hardware/ISP type/Linux Distro/ZeroTier version...

@laduke, thanks for answering.

In my scenario, zerotier is the latest version, 1.4.6 installed everywhere. I have 3 machines with Windows 10 2004, 1 Raspberry Pi4 running Raspberry Pi OS 32bit (raspbian), and 1 Android device. I must say that this issue usually happened randomly with the Windows clients: never happened yet on my raspberry so far. Windows Server 2016 seems to suffer randomly the same thing too. I have a residential connection with a national ISP in two different locations. Now my dashboard sees every machines online but two of them can't ping eachothers and the "zerotier-cli status" command on them shows as OFFLINE.

DIAGRAM:

ISP LOCATION X:
CLIENT A: WINDOWS 10 2004
CLIENT B: RASPBERRY

ISP LOCATION Y:
CLIENT C: WINDOWS 10 2004
CLIENT D: WINDOWS 10 2004

CLIENT A-B and C-D are respectively under the same physical LAN. In location X or Y i can ping through zerotier network all the machines which pyshically are in the same LAN but canno't ping the machines on the other location becouse the Windows Clients are in RELAY status/offline. As i already described previously, if i reset the client's identity and reallow them in the dashboard, everything start working again for two/three days, then the issue comes again.

I've tried to disable and re-enable the virtual network card, stop and start the zerotier service, leave and join the network: nothing of those worked. The only way to make it works again is to reset and reauthorize the identity of the offline clients.

I also am encountering something quite similar, on one particular pc (Debian, baremetal, nat behind static ip) that looses connectivity.

This pc was working fine for months and has only started showing this behavior for the last 15 days or so.
Network connectivity in the network, ping to other clients in the same network doesn't work. My.zerotier shows online, but the status output in the pc says offline. Leaving / rejoining the network doesn't work.

What works for me is stopping the zerotier services for 10-15 seconds, and restarting it. Just restarting the service does not work. I must stop the service for 10 - 15 seconds and then start it again. The it finally will go online for some period of time, network connectivity resumes and all is well. Until it decides to do that again some indeterminate time later..

And here we are: after 5 days of activity, OFFLINE again... I must say that i just tried with the @abclution's tricks and it worked: i stopped zerotier services for 5 min, then restart it: now the client it's online again, but it's a temp solution becouse it will randomly goes offline anytime soon...

Today, the issue happened again. Until a proper fix will eventually comes out, i made a little script which checks the zerotier's online status and when offline it will stops the Zerotier services, waits 10 min, then restarts it. This is the script for Windows cmd:

@echo off
setlocal EnableDelayedExpansion
FOR /F "tokens=5 delims= " %%A IN ('"zerotier-cli status | findstr "ONLINE""') DO set var=%%A
if NOT "%var%" == "ONLINE" (
goto RESTART
)
exit /B

:RESTART
sc stop zerotieroneservice
timeout /t 600 >nul
sc start zerotieroneservice
exit /B

Then create a task scheduler which runs this script with Admin privileges every hour or so.

@graphixillusion Glad to hear this works for you. Yes, it is a very strange issue, and has appeared pretty recently but it is PERSISTENTLY an issue for me. Every ~12 to 16 hours I have to do this. Really sad.

Nice script, I will keep it saved for any Windows boxes I encounter with this error.

I need to write a similar script for linux...

According to the ZT documents, this should not be necessarily. ZT client is supposed to NEVER give up trying to connect, but obviously, it is just giving up once it hits this odd OFFLINE status. No logging makes this impossible to understand better.

@abclution i'm still testing this script on just one box, let's see if this will do the job while the others will goes offline. Yes, it's a very strange issue indeed: in my case it occurrs every 3 days, more or less. I really hope we'll get a proper fix sooner than later...

For anyone encountering this issue in linux, here is an extremely simple script for linux that can be put in cron. You can schedule it whenever

#!/bin/bash
SERVICECMD="/usr/sbin/service"
SERVICENAME="zerotier-one"
ZTCLI="/usr/sbin/zerotier-cli"
ZTSLEEP=15
ISOFFLINE=`$ZTCLI status | grep OFFLINE`


if [[ -z "$ISOFFLINE" ]]; then
   echo "ZEROTIER is not in an offline status."
else
   echo $ISOFFLINE
   echo "STOPPING SERVICE FOR $ZTSLEEP SECONDS"
   $SERVICECMD $SERVICENAME stop
   sleep $ZTSLEEP
   echo "STARTING SERVICE"
   $SERVICECMD $SERVICENAME restart
   echo "ZEROTIER RESTARTED"
fi

Nice.

Hey, can you @abclution can capture some info when it's in the offline state, before restarting?

zerotier-cli peers -j > $(date -d "today" +"%Y%m%d%H%M").log 

If you don't want to paste it here, you can email [email protected] or DM my @zt-travis on the my.zerotier.com community.

Hey sure, yea I already updated my personal script for some logging and failure times.
@laduke Will update when I know more.

@laduke Small update

So, 2 days have gone by and I have minute by minute checking for online status, and logging of the situation.
So far the issue occurred at exactly the same time each day, and for exactly the same amount of time.

The following in the log directory if you look at exactly 4:40 - 5:02, on both days ZT is in an OFFLINE state (those are the peer dumps you asked for) This leads me to believe there is a local job issue so I investigated further.

-rw-r--r--  1 root root 5044 Jul  3 04:40 202007030440.log
-rw-r--r--  1 root root 1230 Jul  3 04:41 202007030441.log
-rw-r--r--  1 root root 2392 Jul  3 04:42 202007030442.log
-rw-r--r--  1 root root 1230 Jul  3 04:43 202007030443.log
-rw-r--r--  1 root root 2392 Jul  3 04:44 202007030444.log
-rw-r--r--  1 root root 1230 Jul  3 04:45 202007030445.log
-rw-r--r--  1 root root 2392 Jul  3 04:46 202007030446.log
-rw-r--r--  1 root root 1230 Jul  3 04:47 202007030447.log
-rw-r--r--  1 root root 2391 Jul  3 04:48 202007030448.log
-rw-r--r--  1 root root 1230 Jul  3 04:49 202007030449.log
-rw-r--r--  1 root root 2392 Jul  3 04:50 202007030450.log
-rw-r--r--  1 root root 1230 Jul  3 04:51 202007030451.log
-rw-r--r--  1 root root 2391 Jul  3 04:52 202007030452.log
-rw-r--r--  1 root root 1229 Jul  3 04:53 202007030453.log
-rw-r--r--  1 root root 2392 Jul  3 04:54 202007030454.log
-rw-r--r--  1 root root 1229 Jul  3 04:55 202007030455.log
-rw-r--r--  1 root root 2391 Jul  3 04:56 202007030456.log
-rw-r--r--  1 root root 1230 Jul  3 04:57 202007030457.log
-rw-r--r--  1 root root 2391 Jul  3 04:58 202007030458.log
-rw-r--r--  1 root root 1229 Jul  3 04:59 202007030459.log
-rw-r--r--  1 root root 2392 Jul  3 05:00 202007030500.log
-rw-r--r--  1 root root 1229 Jul  3 05:01 202007030501.log
-rw-r--r--  1 root root 2391 Jul  3 05:02 202007030502.log
-rw-r--r--  1 root root 4841 Jul  4 04:40 202007040440.log
-rw-r--r--  1 root root 1230 Jul  4 04:42 202007040442.log
-rw-r--r--  1 root root 1229 Jul  4 04:43 202007040443.log
-rw-r--r--  1 root root 2392 Jul  4 04:44 202007040444.log
-rw-r--r--  1 root root 1230 Jul  4 04:45 202007040445.log
-rw-r--r--  1 root root 2391 Jul  4 04:46 202007040446.log
-rw-r--r--  1 root root 1229 Jul  4 04:47 202007040447.log
-rw-r--r--  1 root root 2392 Jul  4 04:48 202007040448.log
-rw-r--r--  1 root root 1229 Jul  4 04:49 202007040449.log
-rw-r--r--  1 root root 2392 Jul  4 04:50 202007040450.log
-rw-r--r--  1 root root 1230 Jul  4 04:51 202007040451.log
-rw-r--r--  1 root root 2392 Jul  4 04:52 202007040452.log
-rw-r--r--  1 root root 1230 Jul  4 04:53 202007040453.log
-rw-r--r--  1 root root 2391 Jul  4 04:54 202007040454.log
-rw-r--r--  1 root root 1229 Jul  4 04:55 202007040455.log
-rw-r--r--  1 root root 2392 Jul  4 04:56 202007040456.log
-rw-r--r--  1 root root 1230 Jul  4 04:57 202007040457.log
-rw-r--r--  1 root root 2392 Jul  4 04:58 202007040458.log
-rw-r--r--  1 root root 1229 Jul  4 04:59 202007040459.log
-rw-r--r--  1 root root 2392 Jul  4 05:00 202007040500.log
-rw-r--r--  1 root root 1230 Jul  4 05:01 202007040501.log
-rw-r--r--  1 root root 2391 Jul  4 05:02 202007040502.log

So, I started poking at my cron jobs to see what was going on, and it looks like a misconfigured job, may be breaking networking for approximately 20 minutes. I have taken the steps to fix this issue and will see what happens after.

What is odd though is still ZT is not attempting to reconnect after the 20 minutes of downtime. ZT just stays stuck in offline until stopping for 10-15 seconds, then restarting the service.

I was under the impression from the docs that ZT doesn't ever stop trying to reconnect I believe we may need some more robust connectivity checking options.

As I said before this machine, with the bad configuration was working fine and ZT reconnecting when possible, for months at a time, (even with the 20 mins network downtime, it was reconnecting after.)

Let me give it a few more days to see if the miscreant job is to blame, then we can figure out why ZT gives up on reconnecting even when network connectivity is restored.

I'm agree with @abclution : zerotier goes offline when it loses connection for whatever reason. When the connection is back, it stays offline until you stop and restart the service. In my case i need to stop it for about 10 min to make it works again.

thanks for working on this.

Hmm. My internet goes out a few times a day here, but this doesn't happen to me.
How many cores and how much RAM do you all have on your machines that experience this?
Where approximately are they located geographically? (I'm on a 16GB quad core mac in California)

@laduke, i have it running on a lattepanda alpha SBC with 8GB of ram and on a VM with 2GB or ram. It happens on a laptop with 4GB of ram too. Every machines with Windows 10.

@laduke

The machine encountering the issue is a Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with 32 gigabytes of ram. (Hetzner Dedicated server)

Hey so the problem has not happened again, so it was definitely caused by loosing internet access for those 20 minutes. The service that was causing the internet outage was a firewall virtual machine that the baremetal gets its internet through.

Now how do we debug this further.. I can probably reliably reproduce this error, should I email you the peers logs from the time period after it disconnects and doesn't try to reconnect?

I think that this issue has nothing to do with the amounts of ram in the system...

We're able to reproduce this internally (sometimes) and are working on it. Not sure if we need to bother people to get peer lists.

I think that this issue has nothing to do with the amounts of ram in the system...

I think you are right.

Hey, can anyone who ends up in the OFFLINE state send their listpeers output, collected while offline.

What kind of router are your devices behind?

@laduke when OFFLINE listpeers output is like in this comment: https://github.com/zerotier/ZeroTierOne/issues/1214#issuecomment-640557587
Btw, when i'll be OFFLINE i will update this comment with the result. I'm running a normal router, ISP branded

Sorry. With the -j flag; it has more info.

@laduke here's my listpeers -j output while offline:

[
 {
  "address": "xxxxxxxxxx",
  "latency": -1,
  "paths": [
   {
    "active": true,
    "address": "185.180.13.82/9993",
    "expired": false,
    "lastReceive": 1595101962420,
    "lastSend": 1595185181400,
    "preferred": false,
    "trustedPathId": 0
   }
  ],
  "role": "PLANET",
  "version": "-1.-1.-1",
  "versionMajor": -1,
  "versionMinor": -1,
  "versionRev": -1
 },
 {
  "address": "xxxxxxxxxx",
  "latency": -1,
  "paths": [
   {
    "active": true,
    "address": "50.7.252.138/9993",
    "expired": false,
    "lastReceive": 1595101962500,
    "lastSend": 1595185181400,
    "preferred": false,
    "trustedPathId": 0
   }
  ],
  "role": "PLANET",
  "version": "-1.-1.-1",
  "versionMajor": -1,
  "versionMinor": -1,
  "versionRev": -1
 },
 {
  "address": "xxxxxxxxxx",
  "latency": -1,
  "paths": [
   {
    "active": true,
    "address": "34.94.131.223/21008",
    "expired": false,
    "lastReceive": 1595185214840,
    "lastSend": 1595185214840,
    "preferred": true,
    "trustedPathId": 0
   }
  ],
  "role": "LEAF",
  "version": "1.4.6",
  "versionMajor": 1,
  "versionMinor": 4,
  "versionRev": 6
 },
 {
  "address": "xxxxxxxxxx",
  "latency": -1,
  "paths": [
   {
    "active": true,
    "address": "103.195.103.66/9993",
    "expired": false,
    "lastReceive": 1595101962312,
    "lastSend": 1595185181400,
    "preferred": false,
    "trustedPathId": 0
   }
  ],
  "role": "PLANET",
  "version": "-1.-1.-1",
  "versionMajor": -1,
  "versionMinor": -1,
  "versionRev": -1
 },
 {
  "address": "xxxxxxxxxx",
  "latency": -1,
  "paths": [
   {
    "active": true,
    "address": "195.181.173.159/9993",
    "expired": false,
    "lastReceive": 1595101962312,
    "lastSend": 1595185181400,
    "preferred": false,
    "trustedPathId": 0
   }
  ],
  "role": "PLANET",
  "version": "-1.-1.-1",
  "versionMajor": -1,
  "versionMinor": -1,
  "versionRev": -1
 },
 {
  "address": "xxxxxxxxxx",
  "latency": 7,
  "paths": [
   {
    "active": true,
    "address": "192.168.1.250/9993",
    "expired": false,
    "lastReceive": 1595185206356,
    "lastSend": 1595185206356,
    "preferred": true,
    "trustedPathId": 0
   }
  ],
  "role": "LEAF",
  "version": "1.4.6",
  "versionMajor": 1,
  "versionMinor": 4,
  "versionRev": 6
 }
]

Like this i can only reach the machine in the same LAN (address": "192.168.1.250/9993")

1186 might be related to this.

@unquietwiki in my case i get the OFFLINE status when the router losts the connection with the ISP: when the connection is re-established, zerotier will always stays at OFFLINE status until i stop the service for atlast 5mins and then start it again. Infact in my script file i have set a sleep time of 10 mins, just to be sure it will works good. Btw, are there any updates about this issue?

@graphixillusion we're still looking into it; thank you for your patience. There may be a couple different things going on; I know @laduke is looking into some Windows client bugs, and there may be another issue I'm looking into.

Is anyone experiencing this issue using Allow Global / Full Tunnel Mode ?

@laduke In my case the only flag checked is "Allow Managed IP". "Allow Global IP" and "Allow Default Route" are both disabled.

I'm agree with @abclution : zerotier goes offline when it loses connection for whatever reason. When the connection is back, it stays offline until you stop and restart the service. In my case i need to stop it for about 10 min to make it works again.

+1

root@v-master:~# zerotier-cli info
200 info 6c23152a38 1.4.6 OFFLINE

root@v-master:~# curl baidu.com
<html>
<meta http-equiv="refresh" content="0;url=http://www.baidu.com/">
</html>

root@v-master:~# systemctl restart zerotier-one.service

root@v-master:~# zerotier-cli info
200 info 6c23152a38 1.4.6 ONLINE

I have to restart manually to be online

Experiencing the same issue here. Any prospects for a fix?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MaskRay picture MaskRay  路  4Comments

coretemp picture coretemp  路  4Comments

kbumsik picture kbumsik  路  4Comments

nolanl picture nolanl  路  4Comments

paweljacewicz picture paweljacewicz  路  4Comments