Electrum: socket.getaddrinfo takes a mutex on Windows

Created on 9 Jun 2018  路  6Comments  路  Source: spesmilo/electrum

socket.getaddrinfo, which gets called pretty much every time we need a dns resolution, takes a lock in CPython, on Windows. A dns resolution might take 10 seconds to fail!
On Linux, there is no lock.
On Mac, there used to be a lock, but not since CPython 3.5.2.


Apart from the odd dns failure due to e.g. an electrum server's domain becoming unavailable, the main problem is .onion resolution without capability of doing so.
E.g. if a user enables a Tor proxy, the recent_servers file might get populated with .onion servers. If the user then disables the Tor proxy, .onion resolution will fail. And it will take 10 seconds to fail. And all other resolutions will wait for it.
The default server list also contains some .onion servers, so just by random chance, clients might try to connect and socket.getaddrinfo takes the lock for 10 seconds.


  • connecting to ~10 electrum servers. we directly call socket.getaddrinfo for each:
    https://github.com/spesmilo/electrum/blob/7043d6907f1af3dbb67f04d77147079e47b3dae5/lib/interface.py#L103

  • connecting to a lightning peer via asyncio.open_connection, that will socket.getaddrinfo too:
    https://github.com/spesmilo/electrum/blob/bc007a3672c6a0c05a0f5e7b8eed217166b8c408/lib/lnbase.py#L763

  • exchange rates use requests.request, that will call socket.getaddrinfo too:
    https://github.com/spesmilo/electrum/blob/7043d6907f1af3dbb67f04d77147079e47b3dae5/lib/exchange_rate.py#L38


    trace

    Traceback (most recent call last):
      File "...\electrum\lib\exchange_rate.py", line 53, in update_safe
        self.quotes = self.get_rates(ccy)
      File "...\electrum\lib\exchange_rate.py", line 252, in get_rates
        '/v1/bpi/currentprice/%s.json' % ccy)
      File "...\electrum\lib\exchange_rate.py", line 38, in get_json
        response = requests.request('GET', url, headers={'User-Agent' : 'Electrum'}, timeout=10)
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request
        return session.request(method=method, url=url, **kwargs)
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request
        resp = self.send(prep, **send_kwargs)
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send
        r = adapter.send(request, **kwargs)
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send
        timeout=timeout
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
        chunked=chunked)
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 346, in _make_request
        self._validate_conn(conn)
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 850, in _validate_conn
        conn.connect()
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection.py", line 284, in connect
        conn = self._new_conn()
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection.py", line 141, in _new_conn
        (self.host, self.port), self.timeout, **extra_kw)
      File "...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection
        for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
      File "...\AppData\Local\Programs\Python\Python36-32\lib\socket.py", line 746, in getaddrinfo
        raise Exception("testing")
    

  • connecting to trustedcoin server, also uses request.request:
    https://github.com/spesmilo/electrum/blob/7043d6907f1af3dbb67f04d77147079e47b3dae5/plugins/trustedcoin/trustedcoin.py#L109

  • Also see https://github.com/spesmilo/electrum/commit/680df7d6b60ffcf66f6c47eb73697da1a8613405#commitcomment-28185450 -- the Trezor Bridge is affected too, as it uses requests.post, which uses socket.getaddrinfo


So, suppose you are trying to connect to a lightning peer, right after startup, but before that happens, the client actually tries to connect to a few electrum servers. If say three of those servers are .onion, the connection to the ln peer will wait on the lock for 30 seconds.......


Note that for asyncio.open_connection, if we pass an IP address (not a domain name), since python 3.5.2, socket.getaddrinfo won't be called.


Related:
https://emptysqua.re/blog/getaddrinfo-on-macosx/
https://stackoverflow.com/a/1212821/7499128

OS-windows 馃獰 topic-network 馃暩

Most helpful comment

So, I am very much considering something like this:

diff --git a/lib/network.py b/lib/network.py
index d6349711f..1ee43064f 100644
--- a/lib/network.py
+++ b/lib/network.py
@@ -32,8 +32,11 @@ from collections import defaultdict
 import threading
 import socket
 import json
+import sys

+import dns
 import socks
+
 from . import util
 from . import bitcoin
 from .bitcoin import *
@@ -413,7 +416,18 @@ class Network(util.DaemonThread):
             socket.getaddrinfo = lambda *args: [(socket.AF_INET, socket.SOCK_STREAM, 6, '', (args[0], args[1]))]
         else:
             socket.socket = socket._socketobject
-            socket.getaddrinfo = socket._getaddrinfo
+            if sys.platform == 'win32':
+                def cheating_getaddrinfo(host, *args, **kwargs):
+                    try:
+                        answers = dns.resolver.query(host)
+                        addr = str(answers[0])
+                    except:
+                        raise socket.gaierror(11001, 'getaddrinfo failed')
+                    else:
+                        return socket._getaddrinfo(addr, *args, **kwargs)
+                socket.getaddrinfo = cheating_getaddrinfo
+            else:
+                socket.getaddrinfo = socket._getaddrinfo

     def start_network(self, protocol, proxy):
         assert not self.interface and not self.interfaces

All 6 comments

"code snippet 1"

So e.g. this is clearly sequential on Windows (by which I mean I have tested on real Windows running on bare hardware and timed it against wall clock):

import socket
import threading

thread_list = []
for i in range(100):
  t = threading.Thread(target=socket.getaddrinfo, args=("aa000000aa"+str(i)+".onion.", 80))
  thread_list.append(t)

[t.start() for t in thread_list]
[t.join() for t in thread_list]

"code snippet 2"

However this for example is parallel:

import dns
from dns import resolver
import threading

thread_list = []
for i in range(100):
  t = threading.Thread(target=dns.resolver.query, args=("aa000000aa"+str(i)+".onion.",))
  thread_list.append(t)

[t.start() for t in thread_list]
[t.join() for t in thread_list]

using https://github.com/rthalley/dnspython -- which we already have as a dependency!

So, I am very much considering something like this:

diff --git a/lib/network.py b/lib/network.py
index d6349711f..1ee43064f 100644
--- a/lib/network.py
+++ b/lib/network.py
@@ -32,8 +32,11 @@ from collections import defaultdict
 import threading
 import socket
 import json
+import sys

+import dns
 import socks
+
 from . import util
 from . import bitcoin
 from .bitcoin import *
@@ -413,7 +416,18 @@ class Network(util.DaemonThread):
             socket.getaddrinfo = lambda *args: [(socket.AF_INET, socket.SOCK_STREAM, 6, '', (args[0], args[1]))]
         else:
             socket.socket = socket._socketobject
-            socket.getaddrinfo = socket._getaddrinfo
+            if sys.platform == 'win32':
+                def cheating_getaddrinfo(host, *args, **kwargs):
+                    try:
+                        answers = dns.resolver.query(host)
+                        addr = str(answers[0])
+                    except:
+                        raise socket.gaierror(11001, 'getaddrinfo failed')
+                    else:
+                        return socket._getaddrinfo(addr, *args, **kwargs)
+                socket.getaddrinfo = cheating_getaddrinfo
+            else:
+                socket.getaddrinfo = socket._getaddrinfo

     def start_network(self, protocol, proxy):
         assert not self.interface and not self.interfaces

@EagleTM found the following:
https://superuser.com/questions/969171/multihomed-windows-10-dns-resolution-timeouts

Microsoft has in Windows 10 substantially modified or rewritten the DNS Resolver.
The biggest change was to issue DNS queries to all adapters in parallel, then take the first answer to arrive. Unfortunately the new code contains bugs and omissions, and it seems that rather than take the first answer, it waits for all answers. If one of the DNS queries will time-out, this means a 10-seconds wait before the DNS is resolved.


I have now re-tested running "code snippet 1" using official x86 cpython 3.7.2 binaries.

(read as "On my test config...")

  1. On a Windows7 [Version 6.1.7601] VM, with a single network interface, the script finishes so fast that it's not possible to tell just by looking at wall clock time whether it's sequential or parallel.
  2. On a Windows7 [Version 6.1.7601] VM, with two network interfaces, the script finishes so fast that it's not possible to tell just by looking at wall clock time whether it's sequential or parallel.
  3. On a Windows10 [Version 10.0.16299.248] VM, with a single network interface, the script finishes so fast that it's not possible to tell just by looking at wall clock time whether it's sequential or parallel.
  4. On a Windows10 [Version 10.0.16299.248] VM, with two network interfaces, the domains are getting resolved sequentially, each taking 10 seconds to fail.
  5. On a Windows10 [Version 10.0.16299.904] real machine, with two network interfaces, the domains are getting resolved sequentially, each taking 10 seconds to fail.

So from these experiments, I would like to conclude that
(1) my "second" interface takes 10 seconds to fail the resolution but win7 does not wait for it and returns the result from the other, win10 waits for it;
(2) resolution on win10 is probably sequential as that should be unaffected by the number of interfaces

Note that it is not possible to conclude if resolution on win7 is sequential or not, from these tests.


@JustinTArthur @kyuupichan

@SomberNight thanks a lot

Yes, thanks @EagleTM and @SomberNight; I hate mysteries! while I'm glad it's not a cpython bug, that would be easier to fix than a Windows bug.

@JustinTArthur well, there are probably two separate issues here;

  • the one described on superuser (which causes a single request to use up the full 10 second timeout); which looks to be a windows bug
  • and the dns resolutions being sequential, which might still be in cpython
Was this page helpful?
0 / 5 - 0 ratings