Dvc: get/import: look for output in all remotes (not just default remote, which may not exist)?

Created on 20 Feb 2020  路  4Comments  路  Source: iterative/dvc

Extracted from https://github.com/iterative/dvc/issues/3369#issuecomment-589248972

This would be useful for dataset registries where each dataset is a directory that uses a different DVC remote. No default remote is set in this kind of project to help prevent people from accidentally pushing a dataset to the wrong remote (e.g. different S3 bucket keys). In a project like that, get and import won't work because they expect a default remote.

Possible implementation solutions

  • Add --remote option to get/import: not great because requires manually inspecting the source project config file before being able to get/import.
  • get/import can try every remote in the source project config file sequentially and use the first one that contains the target data to download
  • New behavior in dvc remote default to be able to set default remotes for specific paths in a project (i.e. this would be done in the source data reg, not when getting/importing)

Thoughts?

enhancement question

Most helpful comment

Nice, I didn't know. Is there an issue or PR for it? This could be the short-term solution.

@jorgeorpinel, check #2466

All 4 comments

Add --remote option to get/import: not great because requires manually inspecting the source project config file before being able to get/import.

Will work, and is already in TODO :)

get/import can try every remote in the source project config file sequentially and use the first one that contains the target data to download

Kinda like things like yum/apt/etc try out different mirrors until something works. I like that idea! I would not jump into implementing it right away, but would give it a thought, since there might be some arch decisions that we need to figure out.

New behavior in dvc remote default to be able to set default remotes for specific paths in a project (i.e. this would be done in the source data reg, not when getting/importing)

It has been discussed previously somewhere, but in reality, it is not trivial to organise in a nice way. We either have to create some file that tells which paths are pushed where, or we have to abuse dvc-files and specify remotes inside, which means that they will be overwritten on some operations, which is not great. There is a subrepo(--subdir) PR though, that solves that problem for monorepo cases, which is, i would say, one of the most popular scenarios in which users might want to have different remotes for different paths. So I would wait for --subdir and then see if there are new requests coming in.

Add --remote option

already in TODO :)

Nice, I didn't know. Is there an issue or PR for it? This could be the short-term solution.

get/import can try every remote

I like that idea! ... there might be some arch decisions

arch decisions?

There is a subrepo (--subdir) PR though

Yeah, this is an alternative solution but adds complexity (managing parent vs children DVC projects).

Thanks Ruslan!

Nice, I didn't know. Is there an issue or PR for it? This could be the short-term solution.

@jorgeorpinel, check #2466

OK so this issue is left here just to consider the enhancement of:

get/import can try every remote in the source project config file sequentially and use the first one that contains the target data to download

I think.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmpetrov picture dmpetrov  路  64Comments

Casyfill picture Casyfill  路  56Comments

JoeyCarson picture JoeyCarson  路  53Comments

shcheklein picture shcheklein  路  36Comments

dmpetrov picture dmpetrov  路  35Comments