Dvc.org: user-guide: initial get-started/data.xml pull doesn't work

Created on 16 Mar 2020  路  15Comments  路  Source: iterative/dvc.org

When I run the first dvc get listed here, I get the following error:

paul ~/GitHub/dvc 禄 dvc get https://github.com/iterative/dataset-registry \
>           get-started/data.xml -o data/data.xml
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files: 
name: ../../home/ubuntu/GitHub/dvc/data/data.xml.jCpvxLLAJHBNeBGwZSWmUp.tmp, md5: a304afb96060aad90176268345e10355
WARNING: Cache 'a304afb96060aad90176268345e10355' not found. File 'data/data.xml.jCpvxLLAJHBNeBGwZSWmUp.tmp' won't be created.
ERROR: failed to get 'get-started/data.xml' from 'https://github.com/iterative/dataset-registry' - The path 'get-started/data.xml' does not exist in the target repository 'https://github.com/iterative/dataset-registry' neither as an output nor a git-handled file.

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
awaiting-response triage

Most helpful comment

Ah :facepalm: sorry @shcheklein I needed to disconnect my VPN. It's all working now :+1:

All 15 comments

PR#1057 isn't the solution. Documentation needs to be updated to reflect steps at iterative/dataset-registry.

@paulkaefer I can't reproduce this. Could you run it with -v and share the log, please?

@shcheklein:

paul ~/GitHub/test 禄 dvc get -v https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
2020-03-16 11:51:23,701 DEBUG: Creating external repo https://github.com/iterative/dataset-registry@None
2020-03-16 11:51:23,873 DEBUG: erepo: git clone https://github.com/iterative/dataset-registry to a temporary dir
2020-03-16 11:51:24,267 DEBUG: erepo: making a copy of https://github.com/iterative/dataset-registry clone
2020-03-16 11:51:24,380 DEBUG: Removing '/home/ubuntu/GitHub/test/data/.3yXxuEYVW5sfU8QDEk5GC4'
2020-03-16 11:51:24,380 ERROR: failed to get ' get-started/data.xml' from 'https://github.com/iterative/dataset-registry' - The path ' get-started/data.xml' does not exist in the target repository 'https://github.com/iterative/dataset-registry' neither as an output nor a git-handled file.
------------------------------------------------------------
Traceback (most recent call last):
  File "/snap/dvc/241/lib/python3.6/site-packages/dvc/external_repo.py", line 94, in pull_to
    fs_copy(fspath(path_info), fspath(to_info))
  File "/snap/dvc/241/lib/python3.6/site-packages/dvc/utils/fs.py", line 27, in fs_copy
    shutil.copy2(src, dst)
  File "/snap/dvc/241/usr/lib/python3.6/shutil.py", line 263, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/snap/dvc/241/usr/lib/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpq0m5c03gdvc-erepo/ get-started/data.xml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/snap/dvc/241/lib/python3.6/site-packages/dvc/command/get.py", line 41, in _get_file_from_repo
    rev=self.args.rev,
  File "/snap/dvc/241/lib/python3.6/site-packages/dvc/repo/get.py", line 55, in get
    repo.pull_to(path, PathInfo(out))
  File "/snap/dvc/241/lib/python3.6/site-packages/dvc/external_repo.py", line 96, in pull_to
    raise PathMissingError(path, self.url)
dvc.exceptions.PathMissingError: The path ' get-started/data.xml' does not exist in the target repository 'https://github.com/iterative/dataset-registry' neither as an output nor a git-handled file.
------------------------------------------------------------

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

To be clear, I'm going through the tutorial. I was able to (1) clone the dataset-registry repo, and (2) run cp ../dataset-registry/get-started/data.xml data/ to get the data.xml file into my getting started local repo.

Ok, I see where that problem comes from

The path ' get-started/data.xml'

mind the space before the path.

Looks like it's copy-paste + terminal problem.

This is exactly how regular CLI tools behave:

touch " test"
rm -f \ test

What I would suggest is to change the command somehow to avoid this copy-paste problems?

@paulkaefer what OS, what browser and terminal do you use?

@shcheklein Ubuntu, default Terminal (bash). Brave Browser.

Command was copied from the top example @ https://dvc.org/doc/get-started/add-files.

@shcheklein good call, though. I removed the space before the \ and dvc get -v https://github.com/iterative/dataset-registry \get-started/data.xml -o data/data.xml works.

@paulkaefer so, when I copy it and paste I'm getting something like this in my browser:

(.env) [ivan@ivan /tmp]$ dvc get https://github.com/iterative/dataset-registry \
>           get-started/data.xml -o data/data.xml

is it the same for you?

@shcheklein yes!

@paulkaefer and when you run w/o modifications, does it work? (it works for me, in my terminal as-is if I just copy-paste it)

@shcheklein yes. Did you change something? Maybe I mis-copied before? I believe the first time, I typed it out in the interest of developing muscle memory for dvc.

@paulkaefer No, I didn't change anything. Looks like some mis-copy or some honest typo :) Closing this for now. Thanks for reporting this though, if we get more complaints we'll think about simplifying some command to fit into a single line.

Thanks, @shcheklein. I've shared the tutorial internally, with my recommendation:

this is how tech tutorials _should_ be (easy to follow, colorful, expand boxes for concepts you might or might know).

I'll be sure and open issues or PRs if I find anything else.

Hi I'm also having a problem with this command, that doesn't seem to be related to the space issue:

I've tried copy and pasting directly from the docs:

dvc get -v https://github.com/iterative/dataset-registry \
          get-started/data.xml -o data/data.xml

and putting it all on one line:

dvc get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

In both cases I get:

$ dvc get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

2020-07-02 11:36:24,693 DEBUG: Creating external repo https://github.com/iterative/dataset-registry@None
2020-07-02 11:36:24,693 DEBUG: erepo: git clone https://github.com/iterative/dataset-registry to a temporary dir
2020-07-02 11:36:27,278 DEBUG: Saving '../../../../../tmp/tmpn23yad7vdvc-clone/get-started/data.xml' to 'data/.HsELAmwFokBPr9emR7s3sd/a3/04afb96060aad90176268345e10355'.
2020-07-02 11:36:27,279 DEBUG: cache '/home/matthew/Documents/muanalytics/dvc/data/.HsELAmwFokBPr9emR7s3sd/a3/04afb96060aad90176268345e10355' expected 'a304afb96060aad90176268345e10355' actual 'None'
2020-07-02 11:36:27,279 DEBUG: cache '/home/matthew/Documents/muanalytics/dvc/data/.HsELAmwFokBPr9emR7s3sd/a3/04afb96060aad90176268345e10355' expected 'a304afb96060aad90176268345e10355' actual 'None'
2020-07-02 11:36:27,303 DEBUG: Preparing to download data from 'https://remote.dvc.org/dataset-registry'
2020-07-02 11:36:27,303 DEBUG: Preparing to collect status from https://remote.dvc.org/dataset-registry
2020-07-02 11:36:27,304 DEBUG: Collecting information from local cache...
2020-07-02 11:36:27,306 DEBUG: cache '/home/matthew/Documents/muanalytics/dvc/data/.HsELAmwFokBPr9emR7s3sd/a3/04afb96060aad90176268345e10355' expected 'a304afb96060aad90176268345e10355' actual 'None'
2020-07-02 11:36:27,308 DEBUG: Collecting information from remote cache...
2020-07-02 11:36:27,309 DEBUG: Matched '0' indexed hashes
2020-07-02 11:36:27,309 DEBUG: Querying 1 hashes via object_exists
2020-07-02 11:36:31,232 DEBUG: Removing '/home/matthew/Documents/muanalytics/dvc/data/.HsELAmwFokBPr9emR7s3sd'
2020-07-02 11:36:31,232 ERROR: failed to get 'get-started/data.xml' from 'https://github.com/iterative/dataset-registry' - could not perform a HEAD request
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 976, in _validate_conn
    conn.connect()
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connection.py", line 308, in connect
    conn = self._new_conn()
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connection.py", line 172, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f99a1d22b38>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 765, in urlopen
    **response_kw
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 765, in urlopen
    **response_kw
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 765, in urlopen
    **response_kw
  [Previous line repeated 2 more times]
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 725, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3-us-east-2.amazonaws.com', port=443): Max retries exceeded with url: /dvc-public/remote/dataset-registry/a3/04afb96060aad90176268345e10355 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f99a1d22b38>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/http.py", line 104, in request
    **kwargs,
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/requests/sessions.py", line 665, in send
    history = [resp for resp in gen]
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/requests/sessions.py", line 665, in <listcomp>
    history = [resp for resp in gen]
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/requests/sessions.py", line 245, in resolve_redirects
    **adapter_kwargs
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='s3-us-east-2.amazonaws.com', port=443): Max retries exceeded with url: /dvc-public/remote/dataset-registry/a3/04afb96060aad90176268345e10355 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f99a1d22b38>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/command/get.py", line 41, in _get_file_from_repo
    rev=self.args.rev,
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/repo/get.py", line 53, in get
    repo.get_external(path, out)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/external_repo.py", line 143, in get_external
    _, _, save_infos = self.fetch_external([path])
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/external_repo.py", line 133, in fetch_external
    download_callback=download_update,
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/base.py", line 1161, in save
    return self._save(path_info, tree, hash_, save_link, **kwargs)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/base.py", line 1169, in _save
    return self._save_file(path_info, tree, hash_, save_link, **kwargs)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/base.py", line 1096, in _save_file
    with tree.open(path_info, mode="rb") as fobj:
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/repo/tree.py", line 274, in open
    path, mode=mode, encoding=encoding, **kwargs
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/repo/tree.py", line 94, in open
    self.repo.cloud.pull(cache_info, remote=remote)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/data_cloud.py", line 85, in pull
    cache, jobs=jobs, remote=remote, show_checksums=show_checksums
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/base.py", line 79, in wrapper
    return f(obj, named_cache, remote, *args, **kwargs)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/local.py", line 710, in pull
    download=True,
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/local.py", line 610, in _process
    download=download,
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/local.py", line 469, in _status
    md5s, jobs=jobs, name=str(remote.path_info)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/base.py", line 812, in hashes_exist
    remote_hashes = self.tree.list_hashes_exists(hashes, jobs, name)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/base.py", line 701, in list_hashes_exists
    ret = list(itertools.compress(hashes, in_remote))
  File "/home/matthew/.pyenv/versions/3.7.2/lib/python3.7/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/home/matthew/.pyenv/versions/3.7.2/lib/python3.7/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/home/matthew/.pyenv/versions/3.7.2/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/matthew/.pyenv/versions/3.7.2/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/base.py", line 694, in exists_with_progress
    ret = self.exists(path_info)
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/http.py", line 125, in exists
    return bool(self.request("HEAD", path_info.url))
  File "/home/matthew/Documents/muanalytics/dvc/build/virtualenv/lib/python3.7/site-packages/dvc/remote/http.py", line 122, in request
    raise DvcException(f"could not perform a {method} request")
dvc.exceptions.DvcException: could not perform a HEAD request
------------------------------------------------------------

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

I'm on Ubuntu 18.04.4 LTS with zsh 5.4.2 (x86_64-ubuntu-linux-gnu). I installed with:

pip install dvc
pip install dvc[s3]

@ivyleavedtoadflax how about this:

wget https://s3-us-east-2.amazonaws.com/dvc-public/remote/dataset-registry/a3/04afb96060aad90176268345e10355

does it work for you?

Ah :facepalm: sorry @shcheklein I needed to disconnect my VPN. It's all working now :+1:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

elleobrien picture elleobrien  路  4Comments

efiop picture efiop  路  5Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

kurianbenoy picture kurianbenoy  路  5Comments

pared picture pared  路  4Comments