The get uses the default remote defined in .dvc/config
If there isn't one defined, it will fail with
file error: No remote repository specified. Setup default repository with
dvc config core.remote <name>
or use:
dvc pull -r <name>
But the get
command doesn't allow the selection of a remote. Moreover, it might be the case that we want to choose another remote other than the default one.
Hi @a0th !
get/import assume that you've set some default remote already in the repo you are getting/importing from since it is a part of its state. Looks like you don't have that one setup. Was it an accident or you consciously don't want to setup default remote in that repository that you are downloading from? Could you elaborate on your scenario, please? 馃檪
Ah, didn't notice the discord discussion. For the record: https://discordapp.com/channels/485586884165107732/485596304961962003/619026170369015818
TL;DR from the discord discussion: Implement the --remote
option for dvc get
to select other remote rather than the _default_.
This is in case that some files are push
d to remote _alpha_ and other files push
d to remote _bravo_. I'm not sure if we should allow this but it seems like a good idea for flexibility.
Btw, there is a clear bug in this message: or use: dvc pull -r <name>
- it has nothing to do with dvc get
.
There is definitely a lot of value in being able to define a different remote per output, or even specify that we don't want to push certain outputs to the remote storage. The biggest use case here is related to security - I want to expose publicly only models but not raw data -> it's better to split them into different remotes, but I believe there are other case as well - different storages (due to price), etc.
@a0th could you please give a bit more context on your specific case? Why do you need multiple remotes in your case?
--remote
is one of the solutions we can use to provide the interface for this, but long term we should think about something else (e.g. keep the info in the DVC-file per output). --remote
is error-prone, cumbersome, etc, especially when you use it with dvc push
/dvc pull
.
I am against adding --remote
for this. The use case of several remotes for different outputs should be handled in a more declarative way on the source repo part. It's either a stated remote in dvc-file for particular out or some declarative cache router config, which says what is pushed to where. This will simplify and make it less error prone for people using the source repo also - dvc push/pull
will simply do the right thing possibly working with several remotes in parallel.
UPDATE: This issue applies to dvc.api.read()/open()
as well as dvc get/import
.
The last part of the https://github.com/iterative/dvc/issues/2711#issuecomment-549741158 by @alvaroabascar comment is related.
The same question to @alvaroabascar as I asked @a0th . It would be great to have a case/setup in mind when it's indeed necessary to provide a remote name like this. Mostly the question is why would we keep remote undefined in the external repo in the first place.
I can totally see a usecase for this in a form of a faster remote mirror :) For example, you want to dvc get
from a public dvc repo, but your company has a local remote with that data already, so dvc get/import
from it would be much faster than from a public repo.
@efiop what you are talking about is even more different. How do you configure your own remote on someone else's repo to refer to it?
@Suor From the top of my head:
git clone https://github.com/user/project
cd project
dvc pull
dvc remote add mirror s3://bucket/path --system # configuring for everyone on machine
dvc push -r mirror
and then
dvc get -r mirror https://github.com/user/project data
Thank you to all the contributors for your work on this very helpful project!
We are running into this issue when trying to use "dvc import" to pull data from a data registry repo.
Our use case is that we have some data that for regulatory reasons cannot leave certain servers.
We want to track the latest md5 hashes for all the data in a central repo.
When working with a project on server A, we want to be able to select server A's dvc cache as a remote to import from.
When working with a project on server B, we want to be able to select server B's dvc cache as a remote to import from.
Currently, we are unable to accomplish this use case via dvc import
We are currently working around this by manually copying the .dvc
and .gitignore
file from the data registry on server A into the project folder on server A and then running dvc checkout
, but this is inconvenient.
Thanks again!
@jburnsAscendHit I think there is a workaround for your case. But let me first clarify, do you use any remotes at all on servers A/B?
Hi @shcheklein , thank you for your question.
I will try to answer as clearly as I can; I am not sure that I understand dvc enough to answer conclusively.
We are manually creating remotes in an attempt to get dvc import
and dvc update
to work between multiple repos all hosted on server A. We are not using remotes to push or pull information from server A to Amazon S3 or other remote file shares.
Here's more details on what I've tried so far:
On server A, with the data registry repo as my current working directory, I tried to configure a local-only default remote like this:
dvc remote add --default --local server-A-local-cache /path/to/serverA/local/dvc/cache
Within a dvc project repo on server A, I tried importing from the data registry repo like this:
dvc import /path/to/serverA/data/registry/repo dataset1
I then received this error message:
ERROR: failed to import 'dataset1' from '/path/to/serverA/data/registry/repo'. - No DVC remote is specified in the target repository '/path/to/serverA/data/registry/repo': config file error: no remote specified. Setup default remote with
dvc config core.remote <name>
or use:
dvc pull -r <name>
We would like to clone the data registry repo out to multiple servers. Therefore, setting core.remote
in the data registry repo would point the default remote to a path that is only present on one server. Thus, we need a way to set the remote used for dvc import
commands differently for different servers. I couldn't see a way to change or configure which remote was used, and this issue seemed to match the problem that was blocking me.
Does this clarify the use case and answer your question?
@jburnsAscendHit This looks weird. Can you show /path/to/serverA/data/registry/repo/.dvc/config
?
@Suor I think this is because --local
is used. I think remote is not defined in /path/to/serverA/data/registry/repo/.dvc/config
@jburnsAscendHit yes, your case makes total sense to me. Can you try as a workaround to define a remote per machine like this with --global
?
Also, it looks like your case is actually relevant to this one optimization https://github.com/iterative/dvc/issues/2599 . I when it's done you won't need to deal with remotes at all in your specific case.
Ah, I see, it shouldn't require global, simple default config level should do the job. It won't work outside server A of cause. Why are you using local config option?
@shcheklein making a scenario depend on optimization is fragile, e.g. it will create edge cases like difference between workspace and master. The error is correct because import refers to a master branch, which doesn't have local config.
@Suor I think local is being used due to:
We would like to clone the data registry repo out to multiple servers. Therefore, setting core.remote in the data registry repo would point the default remote to a path that is only present on one server.
.making a scenario depend on optimization is fragile
I agree. Just mentioning it here since it might resolve this specific case.
We might want consider a local repo with its cache be a legit source for imports, then that won't be a hack) And no remote will be required.
We might want consider a local repo with its cache be a legit source for imports
It's already a legit source for imports, right? #2599 is about fetching from the cache if data is cached already. Or I don't understand what you trying to say :)
Hi @Suor and @shcheklein,
Thank you for the comments!
Setting a global default remote worked; thank you for the workaround!
I think @shcheklein properly identified the use case my team and I are working on. However, I will try to explain it in greater detail to clarify my understanding.
/path/to/serverA/data/registry/repo/.dvc/config currently looks like this (we are using the large files optimization)
[cache]
protected = true
type = "reflink,hardlink,symlink,copy"
As @Suor noted, this looks weird because there is no core.remote
setting within the repo.
However, my team and I would like a different core.remote
setting on each server for our unique regulatory use case. (We want to be able to monitor md5 hashes for all data from a central repo, but we don't want the actual data to leave any of the individual servers. For convenience, we would like to keep core.remote
unset within our primary data registry repo's git history.)
@shcheklein recommended using --global
to define a default remote at a higher level.
To follow this recommendation, I ran
dvc remote add --default --global server-A-local-cache /path/to/serverA/local/dvc/cache
Then, within various dvc projects on serverA, I am now able to run
dvc import /path/to/serverA/data/registry/repo dataset1
to import dataset1 into my projects.
As @shcheklein notes, once #2599 is merged, the dvc remote add
command above will no longer be needed.
Of note, if I would like to override the global default remote for a given project, it seems from the source that both the configs are merged in this order:
This means that settings in the local or repo/default config will override all the global config. Therefore, the local or repo/default config can still be used to change the default remote in cases where a different default remote is needed.
Thank you all for your assistance on this use case and your work on this project!
Any timeline for this one? We currently mirror certain datasets to multiple environments for infosec reasons (no cross-account data movement), and do separate training runs per environment. We need to be able to pull these datasets from the correct s3 bucket for training depending on environment where we're running training. We're currently just using timestamped dataset locations stored in s3, but would love to switch to a dvc dataset repo and have our training code use dvc get --remote {env}
to fetch datasets instead.
We could use dvc import
instead, to make it easier for the training code to keep track of which dataset version it's importing. I believe that in order to do that we'd want to support dvc import --remote
, and to have the remote tracked in the stage file
Hi @pokey !
Thank you for your patience. We didn't start working on it yet, but it is definitely on our radar and should be coming in the nearest weeks.
We also have a use case: We are currently running multiple (redundant, mirrored) remotes. Our primary remote is on a NAS located within our offices. Access is very fast while on the corporate network and very slow over the VPN.
To support development during quarantine, we've established a second remote on S3 ... this has medium speed, but is globally available. I don't intend for this repo to exist forever, but will be essential for the next few months.
This leads to two classes of users:
Yes, there's a small cost for keeping everything in sync, but it's acceptable for now.
This all works nicely, if somewhat manually, in the original data repo using dvc push/pull --remote
However, it breaks when import
ing into an analysis repo. Adding a --remote
option to import
(specifying the remote in the imported from repo, which, yes, you'd need to know apriori) seems like a reasonable solution.
@amarburg thanks for the request! The only thing I want to clarify is your workflow with NAS. You use it at as remote (dvc remote add ...
), but have you seen the Shared Dev Server scenario? I wonder if a shared cache scenario fits better in your case when you use NAS. You don't have to use dvc push/pull in those cases at all or use them to push/pull to/from S3. You won't have to use --remote
and won't need for import as well.
I know that might a pretty big change to the workflow, so mostly asking out or curiosity and get your feedback on this.
@shcheklein thanks, I had not seen that before, and it would be of benefit when we're all working on the machines which NFS-mount the NAS. In general, though we're distributed across enough workstations (even when all in the office) that I don't think we'll ever get away from having a remote.
Most helpful comment
Thank you to all the contributors for your work on this very helpful project!
We are running into this issue when trying to use "dvc import" to pull data from a data registry repo.
Our use case is that we have some data that for regulatory reasons cannot leave certain servers.
We want to track the latest md5 hashes for all the data in a central repo.
When working with a project on server A, we want to be able to select server A's dvc cache as a remote to import from.
When working with a project on server B, we want to be able to select server B's dvc cache as a remote to import from.
Currently, we are unable to accomplish this use case via
dvc import
We are currently working around this by manually copying the
.dvc
and.gitignore
file from the data registry on server A into the project folder on server A and then runningdvc checkout
, but this is inconvenient.Thanks again!