It was agreed that the solution for hosting segmentation models should be GitHub.
This issue requests the creation of an organization (similar to sct-data) that will contain a repository for each model of interest.
The main concern with the repositories approach is that the models are currently large enough to cause issues during git push operations.
So it might be a good idea to also create some sort of _model deployment helper_ to accompany the repositories and make the deployment of models a more seamless process.
If we follow this route, the helper should include a solution to the large files problem (maybe based on git lfs) and maybe other interesting features suggested by the team that handles the creation/training of models.
I don't have a problem with LFS except that it's raising the bar for hosting, and this is really about another github.com limitation in order to monetize their platform. I'm also wondering what happens when attempting to fork a paying customer's publig LFS-enabled repo (@Drulex ?).
100 MB @ 4B/parameter is 25M parameters, which is not that huge for segmentation networks like U-nets (see for example ONNX model zoo section for detection/segmentation https://github.com/onnx/models#object_detection).
Considering that the main consumers of the data are scripts, which are fetching release artifacts, and that release files are not limited as much (2 GB), it could be "portable" to have a small procedure for managing large files while staying in standard git repos.
For example, the repo could store split files and have some boilerplate build script that can split/merge files, so as to perform the release, and allow the contributor which would have to pack/unpack the files in the development workflow. Since the data is opaque, it's not completely crazy to do that kind of thing...
I'm not really in favour of primarily hosting files on eg. OSF because the tracking is not as good as that available in a git repository and it's also a form of lock-in.
i am also questioning why our recent pytorch models are "big", while our previous tensorflow models, which already worked really well, are ~5MB each.
What models should be considered in this issue?
I'm aware of the ones contained in the models directory in the OSF project (which are linked in spinalcordtoolbox/deepseg/models.py), but are there other models somewhere else?
Also, the sct-data contains some deepseg models for the moment, are we moving those at some point?
What models should be considered in this issue?
I'm aware of the ones contained in themodelsdirectory in the OSF project (which are linked inspinalcordtoolbox/deepseg/models.py), but are there other models somewhere else?
additional models are cooking (work of @charleygros @andreanne-lemay @lrouhier @olix86), it is difficult to predict at this point how many models and how large each will be.
Also, the sct-data contains some
deepsegmodels for the moment, are we moving those at some point?
i guess it depends what solution we settle in
Assuming most models (>90%) will be <100MB, is that reasonable to go with Github, and for the remaining 10% (which might not be for production, but only for specific collaborations) go with OSF?
Summary:
split/merge lousy example (yeah, the principle is clunky, but it has the merit of working):
# Template repo
cJ@pouet ~/splitmerge> ls
Makefile README
cJ@pouet ~/splitmerge> cat README
Usage
#####
Because of a github limitation, the files in the repository have to be smaller
than 100 MB. Also, we don't want to pay for git LFS.
Before committing::
make split
After checkout::
make merge
cJ@pouet ~/splitmerge> cat Makefile
.PHONY: merge split
merge:
find pouet-* | sort | xargs cat > pouet
rm -f pouet-*
split:
rm -f pouet-*
split --bytes=1000000 -d pouet pouet-
rm -f pouet
# Add "model"
cJ@pouet ~/splitmerge> dd if=/dev/random bs=1M count=10 of=pouet
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.078636 s, 133 MB/s
cJ@pouet ~/splitmerge> ls
Makefile README pouet
# Prepare for commit
cJ@pouet ~/splitmerge> make split
rm -f pouet-*
split --bytes=1000000 -d pouet pouet-
rm -f pouet
cJ@pouet ~/splitmerge> git add pouet-*
gcJ@pouet ~/splitmerge> git commit -m "pouet"
[master (root-commit) 53cf112] pouet
12 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 README
create mode 100644 pouet-00
create mode 100644 pouet-01
create mode 100644 pouet-02
create mode 100644 pouet-03
create mode 100644 pouet-04
create mode 100644 pouet-05
create mode 100644 pouet-06
create mode 100644 pouet-07
create mode 100644 pouet-08
create mode 100644 pouet-09
create mode 100644 pouet-10
cJ@pouet ~/splitmerge(master)> ls
Makefile README pouet-00 pouet-01 pouet-02 pouet-03 pouet-04 pouet-05 pouet-06 pouet-07 pouet-08 pouet-09 pouet-10
# After checkout
cJ@pouet ~/splitmerge(master)> make merge
find pouet-* | sort | xargs cat > pouet
rm -f pouet-*
cJ@pouet ~/splitmerge(master)> ls
Makefile README pouet
~here → https://github.com/ivadomed-data~
UPDATED 2020-06-22 12:02:49: now using: https://github.com/ivadomed
I'm not sure git-lfs was meant for lock-in. Its default algorithm for finding the filestore is Github-centric -- especially, it demands a client-server model, there's no way to do decentralized peer-to-peer git with it -- but they provide https://github.com/git-lfs/lfs-test-server which you can self-host (some expletive-filled install instructions for it here, or you can use Gitea.
I think git-lfs is the best option on the table.
You can also choose which files are in or out with git-lfs individually or by file extension or by folder (it uses the same syntax as .gitignore). For example:
git lfs track *.onnx
git lfs track models/
git lfs track helper.bin
A flexible option would git-annex. It also works with regular git repos and stores the large content elsewhere off to the side, enabling lazy checkouts. It even, get this, can write to git-lfs -- though probably with different folder structure than straight git-lfs would make. We could make all the models git-annex repos (or datalad repos; same thing, pretty much), configure them with a Github git-lfs mirror and a osf mirror, and even configure our internal smb:// servers as mirrors too, so we can be less worried about lock-in.
But I would rather just use git-lfs straight. We could deploy Gitea on git.neuro.polymtl.ca to have internal mirrors.
thank you for your insights @kousu. So if we go with git-lfs, are you suggesting that we primarily host the data in a server at polytechnique? i'm a bit worried about maintenance (should i worry?). Or were you suggesting to only use git.neuro.polymtl.ca as additional mirrors? If the latter, where would the primary host be located? I'm definitely OK paying for a remote host, i'm just not sure what price ranges we are talking about: <$500/y would be ok for a solution with high BW, redundancy, accessible everywhere on the planet, not worrying about maximum BW or storage space. Is that reasonable?
My "vote", considering the context, would be for git-lfs on github.
I agree. git-lfs on github as the primary mirror with a low-priority plan to deploy an internal mirror.
We could also mirror to gitlab.com; they even have a github importer (I don't know if it imports the LFS too, but I would hope it does; if not, we can just write our own).
It's too bad that git-lfs is so client-server centric! The one really nice feature of git-annex is you can use any git repo as a large file store.
(sorry! I mis-clicked!)
Another nice feature of git-annex is git config annex.largefiles 'largerthan=100kb. You can ask it to make the small model/big model decision for you (and of course you can always override it explicitly). Can we live with keeping all models in LFS?
We will probably have to pay Github for being our CDN. Cost is 1USD/10GB/month: https://help.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/about-billing-for-git-large-file-storage#purchasing-additional-storage-and-bandwidth.
Re @zougloub:
I don't have a problem with LFS except that it's raising the bar for hosting, and this is really about another github.com limitation in order to monetize their platform. I'm also wondering what happens when attempting to fork a paying customer's publig LFS-enabled repo (@Drulex ?).
from that page:
Bandwidth and storage usage only count against the repository owner's quotas. In forks, bandwidth and storage usage count against the root of the repository network. Anyone with write access to a repository can push files to Git LFS without affecting their personal bandwidth and storage quotas or purchasing data packs. Forking and pulling a repository counts against the parent repository's bandwidth limit.
Bandwidth and storage usage only count against the repository owner's quotas. In forks, bandwidth and storage usage count against the root of the repository network. Anyone with write access to a repository can push files to Git LFS without affecting their personal bandwidth and storage quotas or purchasing data packs. Forking and pulling a repository counts against the parent repository's bandwidth limit.
i don't have enough knowledge to read between the lines here. Will the 1USD/10GB/month be affected by:
also (related to the bullet list above), does this quota include both storage and BW?
I can't make up my mind. @zougloub is right, practically git-lfs means Github. It's so much simpler to use than git-annex or datalad, but it has some downsides.
But git-annex can plug into a huge range of storage services, including, as of this week, OSF.io; and I am prototyping the same thing for openneuro.org today. Those last two are funded by research money, and I think they have larger file size limits than Github's 2GB.
In another world, both OSF.io and openneuro.org have deployed git-lfs servers and we wouldn't have to make a compromise.
does this quota include both storage and BW?
The quota is really two quotas, one for storage and one for bandwidth: you can store as much per month as you can download/upload.
This is very unusual; bandwidth is usually sold larger than storage, because you're usually uploading something in order to broadcast it many times. e.g. on https://www.digitalocean.com/pricing/ the bandwidth/storage ratio ranges from 1TB/25GB ~= 40x down to 12TB/3.25TB ~= 3.5x (and in practice you don't usually serve your entire 25GB disk to clients; you'll keep 10GB for OS and webapps and maybe 1GB of actual content).
Will the 1USD/10GB/month be affected by:
* storage (i assume so)
Of course.
* users installing SCT? (hopefully not, because difficult to properly scale)
Yes, definitely, that counts against the bandwidth quota. If we go with git-lfs then sct_download_data -d optic_model would become
git clone https://github.com/neuropoly/sct-optic-model.git optic_model
with git-lfs kicking in underneath.
* CI? (hopefully not, for obvious reasons)
If CI is Travis, then yes, definitely. I don't know if Github would count against their own storage limits if we use Github Actions. We can do an experiment: make a small repo (with a 1GB limit), commit a 100MB file to it, run a CI script in travis that just, say, downloads and checksums the file, see how many times we can do that. Then make a different small repo and do the same with Github Actions.
If we used git-annex (and/or datalad) then at install time the installer would do
git clone https://github.com/neuropoly/sct-optic-model.git optic_model
(cd optic_model; git annex get .);
or even
git clone https://github.com/neuropoly/sct-optic-model.git optic_model
(cd optic_model; git annex get --from=OSF .);
the difference here is that git-annex makes it easier to plug in alternate mirrors.
I have a wild idea for a side project. git-lfs allows configuring the LFS server per-remote (i.e. per mirror). But it only supports https://, which actually means this lfs+https:// protocol. If we patched git-lfs to support, say, osf:// or openneuro:// or dropbox:// URLs we could marry the best features of both.
given that 95% of our compiled binaries / MRI+histo data / DL models are <100MB (per file), maybe we can go ahead with the git/GitHub approach (as originally planned), with one repos per dataset, and see if we are happy with that approach? then for the 5% remaining we could implement @zougloub's suggestion (https://github.com/neuropoly/spinalcordtoolbox/issues/2746#issuecomment-646667197)?
Note that sct_download_data & friends wouldn't get data using a git clone but rather a download from the release tarballs (made using @saulo-p's WIP release script), which aren't metered in the same way.
That plus the fact that the quota is per-repo, I don't think we we will hit bandwidth issues, and I would be tempted to say that it's rather safe to use github's git-lfs, for the main reason that the researcher workflow would be the simplest; git annex or the split+merge workaround work on github, it will be one more hoop to jump through for the developers, I'm not sure how they would appreciate it.
There is one more option... we could also run these new repos on someone else's computer, for instance gitlab (which BTW also doesn't require to create new top-level organizations...) on gitlab.com (https://about.gitlab.com/solutions/education/) or a hosted gitlab instance; gitlab.com has repo limitations but they are friendlier and LFS is not required; self-hosted has no limitation.
Since you said you would be ready to pay a little for this, I'm pretty sure you could get practically-unlimited storage+bandwidth in Polytechnique's IT or on a VPS from our friends at OVH, at the cost of requiring new developer registrations).
Most helpful comment
additional models are cooking (work of @charleygros @andreanne-lemay @lrouhier @olix86), it is difficult to predict at this point how many models and how large each will be.
i guess it depends what solution we settle in