Common-voice: Investigate cheaper options than s3 for hosting the data download

Created on 2 Feb 2018 · 20Comments · Source: mozilla/common-voice

We are paying quite a bit of money right now for this.

Enhancement Investigate

Source

mikehenrty

Most helpful comment

Maybe you can use torrents?

GreenGearX on 3 Feb 2018

👍5

All 20 comments

Maybe you can use torrents?

GreenGearX on 3 Feb 2018

👍5

Good idea @GreenGearX! I have heard you can also create torrents that point at at web url. Which should probably create one of those regardless.

We also want to move to the DAT protocol. But we were having some trouble due to our large (1 million+) number of files. See: https://github.com/datproject/dat/issues/915

In the short term, we might pay for a cheaper solution than s3. Thanks again for the suggestion @GreenGearX ! That's why I created this issue.

mikehenrty on 3 Feb 2018

@mikehenrty Is it a possibility to split it up into chunks? Right now it's an all or nothing download. There are a few drawbacks to this. For example as the filesize increases the chances of a interruption in the connection is more likely. Having to redownload (and waste) more bandwidth. Splitting it up into chunks also makes it easier to validate parts of the recordings without having to commit to a full download.

syilmaz on 5 Feb 2018

DAT (mentioned above) gives the ability to pause and resume syncing, which is useful if the connection is spotty.

As to breaking down the download into chunks, the only useful splits i can think of are validate vs un-validated. This would about half the repo. Anything beyond that might actually make it more annoying for people to download.

mikehenrty on 5 Feb 2018

I agree, splitting up too much could become a nuisance. Splitting between validated and unvalidated coupled with DAT seems like a great option.

syilmaz on 5 Feb 2018

I also think torrents would be a good idea. You could release updated torrents on an RSS feed and people who wanted to help host would simply subscribe and make it available to seed. I for one would be happy to!

adamaze on 8 Feb 2018

+1 for torrents. But I would keep S3 as backup option.

vasimi on 8 Feb 2018

Hmmm, I just came across this which looks pretty promising:
https://docs.aws.amazon.com/AmazonS3/latest/dev/S3TorrentRetrieve.html

mikehenrty on 8 Feb 2018

👍1

This looks good. And it would work for the other datasets hosted on S3, too.

vasimi on 8 Feb 2018

Hello! I've put the tarball up on Dat for now. Users can download via Dat to help offset bandwidth costs a bit for now! Once we fix that bug it'll be great to be able to download subsets of files over Dat.

That S3 -> Torrent thing is cool! Creating a better S3 -> Dat workflow will definitely help make usage easier.

joehand on 14 Feb 2018

❤2

Not entirely sure, but maybe archive.org? They create a torrent too while supplying free downloads.

ghost on 21 Apr 2018

Some updates here for making a torrent link available on webpage?

Bullnados on 8 Mar 2019

Not entirely sure, but maybe archive.org? They create a torrent too while supplying free downloads.

@missuniverse I did some digging and turns out archive.org items cannot have more than 100,000 files. It would have been great if we could since it provides a command line tool that we could use to automate the flow.

Edit: there are 677,021 mp3 files in the en collection

f0cus10 on 31 Mar 2019

@f0cus10 Then its just a matter of how we package it. If the files are in an archive(tar,zip,rar...etc), then we are good. Also, to my knowledge archive.org partners with organizations to archive their stuff which I think might be cheaper then s3. They helped NASA JPL to archive their public images.

ghost on 31 Mar 2019

Yeah, I was thinking that might be a way to bypass those limitations. I'll keep on digging in that area.

f0cus10 on 1 Apr 2019

Hello!

I'm working for the Dat Foundation and we just wanted to reach out to see if you were still interested in exploring different ways to transfer your data. We're just finishing up a release that'll enable Dat to handle larger datasets and have a FUSE mount so you'll be able to store files in a folder on your actual filesystem and have them automatically sync via Dat, or to be able to sparsely download mounted datasets as you're reading from files in this filesystem.

Does that sound like something you folks would be interested in discussing further?

Kind Regards,

Mauve

RangerMauve on 11 Nov 2019

Hi Mauve,

This is something that we are looking at and is on the roadmap but we are not ready to implement currently. We will reach out in the future.

LRSaunders4 on 14 Nov 2019

🚀1 ❤1

Hmmm, I just came across this which looks pretty promising:
https://docs.aws.amazon.com/AmazonS3/latest/dev/S3TorrentRetrieve.html

Hey, I tried to apply this method to download english dataset which is now ~30GB. I got following error. so maybe, this method won't work henceforth.
GET https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-3/en.tar.gz?torrent
.
<?xml version="1.0" encoding="UTF-8"?> <Error><Code>InvalidRequest</Code><Message>Torrent creation is not supported for objects larger than 5368709120</Message><RequestId>0445D1E43E735274</RequestId><HostId>Ndv7t13hIWz/3wPHWLegJnukRaEJq01YH5DLhxpJ8JEXxOSPpiQqRdCX+bNkY10YTMbCm1N9UIY=</HostId></Error>

narendrapetkar on 19 Nov 2019

Closing this as improving dataset access is part of our 2020 work effort and infrastructure is currently being scoped. cc @johngian @phirework for visibility on the scrollback.

mbransn on 10 Mar 2020

Is there somewhere I could follow to track the scoping? :)

RangerMauve on 10 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Buttons should have hard corners (not rounded)

ivonnekn · 5Comments

Build system should automatically import external libraries from node_modules

kenrick95 · 3Comments

Typo in Singapore English

kenrick95 · 4Comments

[Mobile] Subfooter needs adjustments to icons

ivonnekn · 4Comments

[Mobile] Download Common Voice Data button should not stack text

ivonnekn · 5Comments