DVC has dependencies on multiple cloud services (aws, google), though a typical instalation will only use one of them
is it possible to make the package not use or require unneeded packages?
in runtime, it's usually possible to do 'lazy' imports, that will be called only when the specific cloud service is configured
Hi @ophiry
It is a good idea. But it might confuse some users - after changing a cloud in config DVC can start crashing. From pip point of view, we might (in theory) separate dvc package into a few packages: dvc, dvc-aws, dvc-gcp. This approach has advantages as well as disadvantages. We should think about this.
PS:
In the new version, we are going to provide binary packages for all OSs. All the dependencies will be inside the binary packages - no issues with dependencies.
IIRC there's no need to separate packages on pip, you can use conditional dependencies and let people install dvc, dev[aws] or dvc[all] for example.
@villasv thank you! I will take a look at it.
I just had to install dvc inside a pipenv and boy, the GCP packages are a ton of dependencies. It feels like I now have a node_modules inside my project. Seriously, GCP dependencies are about 30% of the Pipfile.lock, while dvc totals 90%.
I believe this issue can be extended to other optional dependencies, like HDFS and so on. It doesn't seem too difficult to implement optional dependencies. I might be able to send a PR this week.
Help wanted?
PRs are always appreciated :) Thank you for looking into it!
Alright. Time to get it done. What are all cache integrations that we could apply this to? I know we have s3, gs and hdfs, though I'm a bit unsure on which dependencies each of those require that are not required by dvc "core".
@villasv All remote drivers are located in dvc/remote. There are four right now: s3(boto3), gs(google-cloud* stuff), hdfs(no pip dependencies required, since it uses hadoop CLI utilitiy) and ssh(paramiko). Everything that uses these dependencies is isolated in respective dvc/remote/*.py files, so it should be pretty easy. Thank you for looking into it! Please feel free to ping me if you need anything.
I noticed that ply is intalled with >3.11 by default, but there's a comment mentioning that 3.8 is required by google-cloud. Should I leave 3.8 as a core package?
Yeah, I would use the same versions that are currently listed in requirements.txt(so ply==3.8) unless there is a problem that requires adjustments.
Got it. How should I run the tests?
Travis and appveyor will handle it from here, thank you :) I'll take a look at it shortly.
Most helpful comment
IIRC there's no need to separate packages on pip, you can use conditional dependencies and let people install
dvc,dev[aws]ordvc[all]for example.