We are paying quite a bit of money right now for this.
Maybe you can use torrents?
Good idea @GreenGearX! I have heard you can also create torrents that point at at web url. Which should probably create one of those regardless.
We also want to move to the DAT protocol. But we were having some trouble due to our large (1 million+) number of files. See: https://github.com/datproject/dat/issues/915
In the short term, we might pay for a cheaper solution than s3. Thanks again for the suggestion @GreenGearX ! That's why I created this issue.
@mikehenrty Is it a possibility to split it up into chunks? Right now it's an all or nothing download. There are a few drawbacks to this. For example as the filesize increases the chances of a interruption in the connection is more likely. Having to redownload (and waste) more bandwidth. Splitting it up into chunks also makes it easier to validate parts of the recordings without having to commit to a full download.
DAT (mentioned above) gives the ability to pause and resume syncing, which is useful if the connection is spotty.
As to breaking down the download into chunks, the only useful splits i can think of are validate vs un-validated. This would about half the repo. Anything beyond that might actually make it more annoying for people to download.
I agree, splitting up too much could become a nuisance. Splitting between validated and unvalidated coupled with DAT seems like a great option.
I also think torrents would be a good idea. You could release updated torrents on an RSS feed and people who wanted to help host would simply subscribe and make it available to seed. I for one would be happy to!
+1 for torrents. But I would keep S3 as backup option.
Hmmm, I just came across this which looks pretty promising:
https://docs.aws.amazon.com/AmazonS3/latest/dev/S3TorrentRetrieve.html
This looks good. And it would work for the other datasets hosted on S3, too.
Hello! I've put the tarball up on Dat for now. Users can download via Dat to help offset bandwidth costs a bit for now! Once we fix that bug it'll be great to be able to download subsets of files over Dat.
That S3 -> Torrent thing is cool! Creating a better S3 -> Dat workflow will definitely help make usage easier.
Not entirely sure, but maybe archive.org? They create a torrent too while supplying free downloads.
Some updates here for making a torrent link available on webpage?
Not entirely sure, but maybe archive.org? They create a torrent too while supplying free downloads.
@missuniverse I did some digging and turns out archive.org items cannot have more than 100,000 files. It would have been great if we could since it provides a command line tool that we could use to automate the flow.
Edit: there are 677,021 mp3 files in the en collection
@f0cus10 Then its just a matter of how we package it. If the files are in an archive(tar,zip,rar...etc), then we are good. Also, to my knowledge archive.org partners with organizations to archive their stuff which I think might be cheaper then s3. They helped NASA JPL to archive their public images.
Yeah, I was thinking that might be a way to bypass those limitations. I'll keep on digging in that area.
Hello!
I'm working for the Dat Foundation and we just wanted to reach out to see if you were still interested in exploring different ways to transfer your data. We're just finishing up a release that'll enable Dat to handle larger datasets and have a FUSE mount so you'll be able to store files in a folder on your actual filesystem and have them automatically sync via Dat, or to be able to sparsely download mounted datasets as you're reading from files in this filesystem.
Does that sound like something you folks would be interested in discussing further?
Kind Regards,
Mauve
Hi Mauve,
This is something that we are looking at and is on the roadmap but we are not ready to implement currently. We will reach out in the future.
Hmmm, I just came across this which looks pretty promising:
https://docs.aws.amazon.com/AmazonS3/latest/dev/S3TorrentRetrieve.html
Hey, I tried to apply this method to download english dataset which is now ~30GB. I got following error. so maybe, this method won't work henceforth.
GET https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-3/en.tar.gz?torrent
.
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidRequest</Code><Message>Torrent creation is not supported for objects larger than 5368709120</Message><RequestId>0445D1E43E735274</RequestId><HostId>Ndv7t13hIWz/3wPHWLegJnukRaEJq01YH5DLhxpJ8JEXxOSPpiQqRdCX+bNkY10YTMbCm1N9UIY=</HostId></Error>
Closing this as improving dataset access is part of our 2020 work effort and infrastructure is currently being scoped. cc @johngian @phirework for visibility on the scrollback.
Is there somewhere I could follow to track the scoping? :)
Most helpful comment
Maybe you can use torrents?