Dvc: Support IPFS, DAT, or other distributed storage

Created on 22 Jul 2018 · 7Comments · Source: iterative/dvc

Having an option to share data files in a peer-to-peer way is probably a good idea. It eliminates the need to pay for external services, and scales much better in the "public open project" situation (where lot of cloners would mean substantial S3 costs).

IPFS is probably the easiest to support here, together with DAT. Using BitTorrent directly seems complicated.

feature request help wanted p3-nice-to-have

Source

remram44

👍7

Most helpful comment

Are there any plans to implement this? @remram44 @dmpetrov
Would a contribution still be valued?

Also I can see further applications in the field of scientific reproducibility and general public data sharing.

flxai on 18 Dec 2019

👍2

All 7 comments

Hi @remram44 !

Great idea, thank you! If you wish to implement support for some of those, please feel free to take a look at dvc/remote/ directory in our project, where all current remote drivers are implemented. To support data pulling/pushing, you would only need to implement download(), upload() and exists() methods, so it should be pretty easy. We will be happy to merge any proper pull request :slightly_smiling_face:

A current workaround for that would be to just pack your .dvc/cache directory and share it with others using any P2P protocols you'd like, including bittorrent :slightly_smiling_face:

Thanks,
Ruslan

efiop on 22 Jul 2018

Indeed this looks easy to add. I don't have the cycles to attempt this now, but I might try in the future.

remram44 on 22 Jul 2018

❤1

The idea of storing data in p2p\blockchain looks very appealing. We develop DVC based mostly on our industrial data science experience where p2p is not a big part of this industrial environment. But it might become soon! Recently, I got another request (not in GItHub) regarding denet.pro dApp for storing data for DVC.

It would be great to understand this p2p datasets landscape:

who uses these tools for storing datasets for ML?
what types of ML projects: deep NN for image, NLP, academia analitical projects ...?
what are use cases: storing common data sources, project results, all data derivatives?
what are the most common protocols in this space and difference between them?

If there is a demand we can definitely implement this.

@remram44 please let me know if you use this kind of storages. I would really like to discuss what are use cases and your thoughts. Or if you can connect us to other users or the tool\protocol creators.

dmpetrov on 22 Jul 2018

👍1

I'm thinking about the case where you make a analysis public, e.g. publish it on GitHub. Having everyone download from your S3 bucket would incur charges, hosting it on some box in your lab would provide very limited bandwidth. Peer-to-peer solutions would scale nicely.

remram44 on 23 Jul 2018

👍2

Are there any plans to implement this? @remram44 @dmpetrov
Would a contribution still be valued?

Also I can see further applications in the field of scientific reproducibility and general public data sharing.

flxai on 18 Dec 2019

👍2

@icks No such plans from the core team, at least for now. Would appreciate if you could share your thoughts on this and in which scenarios you would like to use it. Contributions are always welcomed, feel free to give it shot. Ping us here or on discord if you need any help 🙂

efiop on 18 Dec 2019

@icks it would be a good addition indeed. Unfortunately, it would take a while for the core team to prioritize this like @efiop mentioned :( We would really love for the community to do a contribution in this case and we can provide all the support and help on this.