DVC version: 0.62.1
Python version: 3.7.3
Platform: Darwin-18.7.0-x86_64-i386-64bit
Binary: False
Cache: reflink - True, hardlink - True, symlink - True
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')
I'm trying to import a directory versioned in our own dataset registry project into an empty, non-Git DVC project, but getting this cryptic error:
$ dvc import --rev 0547f58 \
[email protected]:iterative/dataset-registry.git \
use-cases/data
Importing 'use-cases/data ([email protected]:iterative/dataset-registry.git)' -> 'data'
ERROR: failed to import 'use-cases/data' from '[email protected]:iterative/dataset-registry.git'. - unable to find DVC-file with output '../../../../private/var/folders/_c/3mt_xn_d4xl2ddsx2m98h_r40000gn/T/tmphs83czecdvc-repo/use-cases/data'
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
The directory in question has file name b6923e1e4ad16ea1a7e2b328842d56a2.dir
(See use-cases/cats-dogs.dvc of that version). And the default remote is [configured[(https://github.com/iterative/dataset-registry/blob/master/.dvc/config) to https://remote.dvc.org/dataset-registry (which is an HTTP redirect to the s3://dvc-public/remote/dataset-registry bucket). The file seems to be in the remote
Am I just doing something wrong here (hopefully), or is dvc import
broken?
p.s. I've also tried without --rev
and get the same error (different output path).
@jorgeorpinel should it be use-cases/cats-dogs
?
馃う鈥嶁檪 Oops. I forgot I changed the directory name from data
(original name in the ZIP files used in the Versioning tutorial). But I still can't get it with the correct path:
$ dvc import --rev 0547f58 \
[email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
Importing 'use-cases/cats-dogs ([email protected]:iterative/dataset-registry.git)' -> 'cats-dogs'
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: ../../../../private/var/folders/_c/3mt_xn_d4xl2ddsx2m98h_r40000gn/T/tmpfnwm64lqdvc-repo/use-cases/cats-dogs, md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir
Missing cache for directory '../../../../private/var/folders/_c/3mt_xn_d4xl2ddsx2m98h_r40000gn/T/tmpfnwm64lqdvc-repo/use-cases/cats-dogs'. Cache for files inside will be lost. Would you like to continue? Use '-f' to force. [y/n] y
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: ../../../../private/var/folders/_c/3mt_xn_d4xl2ddsx2m98h_r40000gn/T/tmpfnwm64lqdvc-repo/use-cases/cats-dogs, md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir
WARNING: Cache 'b6923e1e4ad16ea1a7e2b328842d56a2.dir' not found. File 'cats-dogs' won't be created.
ERROR: failed to import 'use-cases/cats-dogs' from '[email protected]:iterative/dataset-registry.git'. - output 'cats-dogs' does not exist
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
With the additional issue of really long and cryptic messages, as well as a prompt I don't understand and just say
y
. (Both issues already reported in #2599.)
What does the "output 'cats-dogs' does not exist" error mean?
Hmmmm... Apparently I didn't push that version of the cats-dogs
dir in the dataset registry project. It's not in the remote. I'll have to fix that first... I guess this issue is invalid then, fortunately! The messaging here is still pretty confusing though, should I open another issue about this?
@jorgeorpinel yes, please open a new UI issue!
Done! #2602
The directory in question has file name
b6923e1e4ad16ea1a7e2b328842d56a2.dir
(See use-cases/cats-dogs.dvc of that version).
So, I pushed the data to the remote now and checked that it actually exists on S3:
$ aws s3 ls s3://dvc-public/remote/dataset-registry/b6/
2019-10-05 01:51:13 6388 2f5c18d1af468fd41c979873a8404b
2019-10-05 01:51:41 22202 4ced1e881cc37c0e0673bafe6e789c
2019-10-12 19:10:03 161184 923e1e4ad16ea1a7e2b328842d56a2.dir <-- Bingo
2019-10-05 01:50:56 17450 efd10ab38ff17fa593e3b102d088ac
However, I try to import it (into the same empty non-Git DVC project) and, although the progress bar runs for a while up to around 90%, the progress bar suddenly disappears and I get:
$ dvc import --rev 0547f58 \
[email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
Importing 'use-cases/cats-dogs ([email protected]:iterative/dataset-registry.git)' -> 'cats-dogs'
ERROR: failed to import 'use-cases/cats-dogs' from '[email protected]:iterative/dataset-registry.git'. - could not perform a HEAD request
And nothing is downloaded. I've tried several times. My Internet connection is fine:
Expand for SpeedTest screen capture
https://www.speedtest.net/result/8671087724
Is there a single file missing or something? How do I find it? I've tried (Investigated in following https://github.com/iterative/dvc/issues/2600#issuecomment-541440512)dvc push
from the source project again and it states Everything is up to date.
p.s. Here's the last part of the -v
output of the same command: https://pastebin.com/9tPWivJr (Includes the full Python exception traceback.)
That one run failed at file adb29c1de1624c53c808f1a15bd332ba
, but it's there:
$ aws s3 ls s3://dvc-public/remote/dataset-registry/ad/b29c1de1624c53c808f1a15bd332ba
2019-10-05 01:51:44 22427 b29c1de1624c53c808f1a15bd332ba
@iterative/engineering p0
since it's a blocker and a potential bug.
Can reproduce on my mac, but not on linux
ERROR: failed to import 'use-cases/cats-dogs' from '[email protected]:iterative/dataset-registry.git'. - could not perform a HEAD request
------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 57, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 301, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x116945310>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='remote.dvc.org', port=443): Max retries exceeded with url: /dataset-registry/61/5bb7cebf1779b530f33b100d1f14b5 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x116945310>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dvc/remote/http.py", line 87, in _request
return requests.request(method, url, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='remote.dvc.org', port=443): Max retries exceeded with url: /dataset-registry/61/5bb7cebf1779b530f33b100d1f14b5 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x116945310>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dvc/command/imp.py", line 20, in run
rev=self.args.rev,
File "/usr/local/lib/python3.7/site-packages/dvc/repo/imp.py", line 6, in imp
return self.imp_url(path, out=out, erepo=erepo, locked=True)
File "/usr/local/lib/python3.7/site-packages/dvc/repo/__init__.py", line 33, in wrapper
ret = f(repo, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 4, in run
result = method(repo, *args, **kw)
File "/usr/local/lib/python3.7/site-packages/dvc/repo/imp_url.py", line 25, in imp_url
stage.run()
File "/usr/local/lib/python3.7/site-packages/dvc/stage.py", line 861, in run
self.deps[0].download(self.outs[0])
File "/usr/local/lib/python3.7/site-packages/dvc/dependency/repo.py", line 77, in download
out = self.fetch()
File "/usr/local/lib/python3.7/site-packages/dvc/dependency/repo.py", line 72, in fetch
repo.cloud.pull(out.get_used_cache())
File "/usr/local/lib/python3.7/site-packages/dvc/data_cloud.py", line 81, in pull
show_checksums=show_checksums,
File "/usr/local/lib/python3.7/site-packages/dvc/remote/local/__init__.py", line 412, in pull
download=True,
File "/usr/local/lib/python3.7/site-packages/dvc/remote/local/__init__.py", line 376, in _process
download=download,
File "/usr/local/lib/python3.7/site-packages/dvc/remote/local/__init__.py", line 301, in status
md5s, jobs=jobs, name=str(remote.path_info)
File "/usr/local/lib/python3.7/site-packages/dvc/remote/base.py", line 738, in cache_exists
ret = list(itertools.compress(checksums, in_remote))
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
yield fs.pop().result()
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.7/site-packages/dvc/remote/base.py", line 731, in exists_with_progress
ret = self.exists(path_info)
File "/usr/local/lib/python3.7/site-packages/dvc/remote/http.py", line 50, in exists
return bool(self._request("HEAD", path_info.url))
File "/usr/local/lib/python3.7/site-packages/dvc/remote/http.py", line 89, in _request
raise DvcException("could not perform a {} request".format(method))
dvc.exceptions.DvcException: could not perform a HEAD request
------------------------------------------------------------
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
Reproduction steps for Linux:
script
#!/bin/bash
rm -rf repo
mkdir repo
cd repo
dvc init --no-scm
dvc import --rev 0547f58 \
[email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
Number of max connections here needs to be changed to some big amount. For me 10k worked.
It seems like we are hitting some limit here.
Related: #2473
It seems that the problem is that, with every request send, we are reserving socket "through" requests
API, which is taking open file descriptor slot. In this particular case, in method RemoteLOCAL.cache_exists
we try to paralelly do a lot of HEAD
calls which leads to overcoming open file descriptors limit.
example:
ulimit -n 16
and run:
from requests import sessions
from requests import head
from concurrent.futures import ThreadPoolExecutor
import time
def run_session(i):
try:
head("https://www.google.com")
except Exception as e:
print(e)
with ThreadPoolExecutor(max_workers=24) as executor:
args = [i for i in range(24)]
executor.map(run_session, args)
print("finished")
Related: #2473 ('Errno 24 - Too many open files' on dvc push)
I didn't have any problem pushing this whole directory (1800 images) from the source project though. I'm guessing probably dvc pull
will also work fine, let me check...
dvc pull
also works just fine (from the source project, after deleting the pushed directory). What makes import
different?
p.s. I also just tried dvc get
and the same problem occurs. What makes these different from fetch
/pull
?
Little summary so far:
remote/base.cache_exists
fetch/pull
does not have the same problems as import/get
Possible way of handling the problem:
The problem might be triggered because requests.sessions.Session
object is created upon each requests.request
calls. Maybe we could solve that by creating our own Session
object, mounting proper HTTPAdapters
and reusing this session, instead of calling requests.request
each time.
Can reproduce this same bug on windows too :(
For the record, this only breaks with binary installs. pip works fine. If you are expriencing this, try uninstalling the binary package and installing from pip or conda.
EDIT: wrong issue, it was meant for https://github.com/iterative/dvc/issues/2589
You mean on Windows? My install on Mac is from pip3
.
@jorgeorpinel oops, sorry, wrong issue.
https://requests.kennethreitz.org/en/master/user/advanced/ says that session is using a connection pool by default. Chaning to using session instead of requests.request directly made everything work for me and I no longer see fluctuations in fd numbers. Will send a patch ASAP. Kudos @pared :tada:
I can confirm it's fixed for me as well in DVC 0.63.4. Thanks!!!
Most helpful comment
https://requests.kennethreitz.org/en/master/user/advanced/ says that session is using a connection pool by default. Chaning to using session instead of requests.request directly made everything work for me and I no longer see fluctuations in fd numbers. Will send a patch ASAP. Kudos @pared :tada: