dvc get/import shows temporary paths

Created on 14 Oct 2019  ยท  5Comments  ยท  Source: iterative/dvc

$ dvc -V
0.62.1
$ dvc get https://github.com/iterative/example-get-started data/data.xml
Multi-Threaded:
 79%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰  |../../../../private/var/folders/hq/h82mgrhx37nc89_1rcg6gm30081024/37916850 [00:08<00:03,    2.25MB/s]

$ dvc get https://github.com/iterative/example-get-started model.pkl
Multi-Threaded:
 62%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–   |../../../../private/var/folders/hq/h82mgrhx37nc89_1rcg6gm283866624/6262877 [00:00<00:02 ...

Is there a reason to show temporary paths and what actions we can make on this? I'd expect to see just a file name like model.pkl?

c8-full-day p1-important ui

Most helpful comment

So the core culprit is stringifying PathInfo objects, which don't know about repo and so can't show sane paths, stringifying output objects is, however, ok. So:

  • it's ok to use str(self) where @efiop mentioned
  • OutputLocal.__str__() should be fixed though
  • the code Ruslan showed above is almost right, the issue is that a dvc file might refer outside repo dir with a rel path

I propose this logic for OutputLocal.__str__():

if not self.is_in_repo:
    return self.def_path
elif <curdir isin self.repo.root_dir>:
    return relpath(self.path_info, curdir)
else:
    return relpath(self.path_info, self.repo.root_dir)

Which means we show relative paths for the repo we are in and rel to repo root paths for all other repos. Remotes and absolute paths are always shown as is.

On <curdir isin ...> this not quire .startswith() as that is wrong for paths:

"a/b/c" isin "a/b" == True
"a/bc" isin "a/b" == False

This would fix the issue and we won't luckily need chdirs nor a global var.

P.S. As a second cleanup/refactor stage I would propose completely stop using PathInfo.__str__(), since it fundamentally doesn't know how to stringify itself. Raise TypeError() there and fix all the usages. This is definitely separate, don't think about it until the issue itself is resolved.

All 5 comments

For the record: we can either chdir() into the temporary repo root or to make a global variable that we can adjust that will make PathInfo.str() use it for relpath (Kudos @Suor).

This is not only about dvc get. Import are also affected, see #2691

An addition about global variable, which might be used as base for PathInfo.__str__(), it should be:

  • really a thread local, not a simple variable
  • not accessed nor changed outside dvc/path_info.py directly
  • managed with a context manager:
with path_info_base(some_dir):
    # ... path_infos are stringified differently here

This will protect us from usual global var issues making it a dynamic scope variable if you know what I mean ;)

From discord:


@kurianbenoy:
I am interested to fix that issue, even if it takes more time to understand the codebase

@efiop:
Glad to hear that! :slight_smile: Got it. Okay, let's think about some clever ways to fix that then.
ok, so we are using named cache for that. We could make it check if we are inside of a repo or not and if not then print a relpath computed relative to the repo.root_dir instead of cwd.
the piece of the code that is responsible for that is https://github.com/iterative/dvc/blob/0.66.4/dvc/output/base.py#L410

specifically str(self) that is passed as a name
so if we could do something like:

if os.getcwd().startswith(self.stage.repo.root_dir):
    name = str(self)
else:
    name = relpath(self.path_info, self.stage.repo.root_dir)

and then use that name instead of str(self) in that line, then it would automatically start working in an acceptable way.

@Suor: Some outs might be outside repo dir, but still "in repo", referencing them simply by name would be confusing

So the core culprit is stringifying PathInfo objects, which don't know about repo and so can't show sane paths, stringifying output objects is, however, ok. So:

  • it's ok to use str(self) where @efiop mentioned
  • OutputLocal.__str__() should be fixed though
  • the code Ruslan showed above is almost right, the issue is that a dvc file might refer outside repo dir with a rel path

I propose this logic for OutputLocal.__str__():

if not self.is_in_repo:
    return self.def_path
elif <curdir isin self.repo.root_dir>:
    return relpath(self.path_info, curdir)
else:
    return relpath(self.path_info, self.repo.root_dir)

Which means we show relative paths for the repo we are in and rel to repo root paths for all other repos. Remotes and absolute paths are always shown as is.

On <curdir isin ...> this not quire .startswith() as that is wrong for paths:

"a/b/c" isin "a/b" == True
"a/bc" isin "a/b" == False

This would fix the issue and we won't luckily need chdirs nor a global var.

P.S. As a second cleanup/refactor stage I would propose completely stop using PathInfo.__str__(), since it fundamentally doesn't know how to stringify itself. Raise TypeError() there and fix all the usages. This is definitely separate, don't think about it until the issue itself is resolved.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TezRomacH picture TezRomacH  ยท  3Comments

shcheklein picture shcheklein  ยท  3Comments

anotherbugmaster picture anotherbugmaster  ยท  3Comments

mdscruggs picture mdscruggs  ยท  3Comments

jorgeorpinel picture jorgeorpinel  ยท  3Comments