Currently we only support directories for local remotes. Need to add support for:
Most likely, this could be achieved in single pass by unifying things like RemoteLOCAL.load_dir_cache into RemoteBase.
My team and I ready need the support of defining directories as output in our setup where we use SSH remote. It would simplify multiple of our DVC stages.
I hope this feature comes soon to SSH馃榿
@efiop, Is this feature suited for a pull request? Because this is the one feature I need the most. Any recommendations on how to begin on the feature implementation for SSH remote, such as code files, class, function, etc?
@PeterFogh I have a patch for supporting it for all remotes lying around in a half-completed state. Let me see if I can speed it up.
@efiop, that sounds great. I鈥檓 just in the mood for contributing to a open source project this weekend馃榿 maybe I can find an other (and maybe more simple) issue.
I鈥檓 looking forward to seeing your solution for this one.
@PeterFogh Sorry for stealing it 馃槈 Please take a look at help wanted
and good first issue
labels on our issues list if you still feel like contributing 馃檪 Let me know if you need any suggestions.
@efiop, no pressure, but do you think this feature will be released within the next week? I can use your estimate for the planning for my team :)
@PeterFogh , it is on the "Weekly tasks", so I'm sure @efiop is considering working on this, although I can't guarantee it would be released on the next week, at least it will be on a WIP status :)
@PeterFogh I am working on this. The patch as a whole is pretty big, I'm in the middle of fixing tests for it (have some other tasks too). Need to fix the rest of the tests, rebase on top of master and submit it. I expect it to be ready during the next week max. Sorry for the delay.
Hi, @mroutis, @efiop. Thanks for your replies. It sounds awesome and I'm glad to hear that it may be released next week because it fits well with our planning at my end.
@PeterFogh Really pushing it to be ready until Thursday. Thank you for your patience.
@efiop sounds awesome. I'm really looking forward to the next release. Especially, if it includes both #1654 and #1614 - patience is ways rewarding! 馃槃
@PeterFogh Actually #1614 has just been released in 0.35.5, please feel free to upgrade. #1654 is coming soon :slightly_smiling_face:
@PeterFogh FYI: Have a working implementation for ssh at https://github.com/efiop/dvc/tree/1654 . Now polishing and adding proper tests. ETA is this week. Thanks for your patience.
Closed accidentally. Generalization and support for directories on ssh is done.
Hi @efiop, I have refactored my DVC pipeline and code in https://github.com/PeterFogh/dvc_dask_use_case to utilize your new and awesome support for directories 馃憤 . It works like a charm 馃槃
FYI, I have used the feature to create a folder for the output files each DVC stage, this simplifies the file order and structure on the remote data storage as well as in the DVC stage files.
When do you think the commit with this feature is released into the next version?
No pressure, but I'm ready missing it in my work this week 馃槃
@PeterFogh Glad it works for you! :slightly_smiling_face: So sorry for the delay, we have a few major things coming and just trying to finalize it so it is ready for release. We plan to release 0.40 this week.
@efiop, it fine, just happy to see the progress 馃槂 Yeah, I assume it is your PyCon attendance this is pressing? A new release this week would be awesome 馃憤 Thanks for the update.
@PeterFogh 0.40.1 is finally out 馃檪Please upgrade.
I'd also appreciate S3 directory caching support. My use case is an EMR cluster using Spark to write many GB's of partitioned CSVs to S3 so the data can be copied more easily to AWS Redshift.
As pointed by @Suor , need to check if you could have s3://bucket/path and s3://bucket/path/ at the same time.
What about Azure Blob Storage? E.g. files in ADLSGen2.
Found this issue due to recent Discord conversation.
Oh. And should some docs be updated already to let people know SSH directories are supported as external dependencies and outputs?
p.s. I notices Azure is not mentioned in our External Dependencies guide, but it's a supported remote. Is the doc incomplete?
Thanks!
@jorgeorpinel
What about Azure Blob Storage? E.g. files in ADLSGen2.
We only support it through blob API, there is no native support for it at all right now. Plus, as described in that conversation, ETags on Azure don't seem to be consistent enough to be used as a hash for external outputs.
Oh. And should some docs be updated already to let people know SSH directories are supported as external dependencies and outputs?
Good point. I guess we can include it into https://github.com/iterative/dvc.org/issues/411 ticket?
p.s. I notices Azure is not mentioned in our External Dependencies guide, but it's a supported remote. Is the doc incomplete?
Azure is not supported for external dependencies and outputs. So the doc is fine.
OK, thanks. Added note to update docs to https://github.com/iterative/dvc.org/issues/411#issuecomment-540129838.
Hi, does dvc support external storage for s3 currently? It returns "failed to add file - output 's3://dvcbucket/mydata' does not exist" after I run command "dvc add s3://dvcbucket/mydata". However, I do have created bucket "dvcbucket" and folder "mydata" before.
Thanks.
Hello, @loche415 , I'm working on supporting directories for S3 outputs and dependencies. There's a _work in progress_ already, it should be ready for tomorrow :)
Thanks for your hard work @mroutis and looking forward to this new feature.
@matt-miller-virginia @loche415 Hi guys! Support for s3 directories is already released in 0.66.0, please upgrade and give it a try :) Big thanks to @mroutis for implementing it! 馃檹 And thank you guys for the feedback! 馃檪
For the record, s3 was added to the doc ticket too https://github.com/iterative/dvc.org/issues/411
Closing for now, as no one is asking for gs support and hdfs is not possible to support right now(see comment in the issue desc).
Most helpful comment
@PeterFogh , it is on the "Weekly tasks", so I'm sure @efiop is considering working on this, although I can't guarantee it would be released on the next week, at least it will be on a WIP status :)