Spinalcordtoolbox: Host large segmentation models

Created on 17 Jun 2020 · 23Comments · Source: spinalcordtoolbox/spinalcordtoolbox

It was agreed that the solution for hosting segmentation models should be GitHub.
This issue requests the creation of an organization (similar to sct-data) that will contain a repository for each model of interest.

The main concern with the repositories approach is that the models are currently large enough to cause issues during git push operations.
So it might be a good idea to also create some sort of _model deployment helper_ to accompany the repositories and make the deployment of models a more seamless process.
If we follow this route, the helper should include a solution to the large files problem (maybe based on git lfs) and maybe other interesting features suggested by the team that handles the creation/training of models.

installation needs-discussion

Source

saulo-p

👍3

Most helpful comment

What models should be considered in this issue?
I'm aware of the ones contained in the models directory in the OSF project (which are linked in spinalcordtoolbox/deepseg/models.py), but are there other models somewhere else?

additional models are cooking (work of @charleygros @andreanne-lemay @lrouhier @olix86), it is difficult to predict at this point how many models and how large each will be.

Also, the sct-data contains some deepseg models for the moment, are we moving those at some point?

i guess it depends what solution we settle in

jcohenadad on 18 Jun 2020

👍3

All 23 comments

I don't have a problem with LFS except that it's raising the bar for hosting, and this is really about another github.com limitation in order to monetize their platform. I'm also wondering what happens when attempting to fork a paying customer's publig LFS-enabled repo (@Drulex ?).

100 MB @ 4B/parameter is 25M parameters, which is not that huge for segmentation networks like U-nets (see for example ONNX model zoo section for detection/segmentation https://github.com/onnx/models#object_detection).

Considering that the main consumers of the data are scripts, which are fetching release artifacts, and that release files are not limited as much (2 GB), it could be "portable" to have a small procedure for managing large files while staying in standard git repos.
For example, the repo could store split files and have some boilerplate build script that can split/merge files, so as to perform the release, and allow the contributor which would have to pack/unpack the files in the development workflow. Since the data is opaque, it's not completely crazy to do that kind of thing...

I'm not really in favour of primarily hosting files on eg. OSF because the tracking is not as good as that available in a git repository and it's also a form of lock-in.

zougloub on 17 Jun 2020

i am also questioning why our recent pytorch models are "big", while our previous tensorflow models, which already worked really well, are ~5MB each.

jcohenadad on 17 Jun 2020

😕1

What models should be considered in this issue?
I'm aware of the ones contained in the models directory in the OSF project (which are linked in spinalcordtoolbox/deepseg/models.py), but are there other models somewhere else?

Also, the sct-data contains some deepseg models for the moment, are we moving those at some point?

saulo-p on 18 Jun 2020

What models should be considered in this issue?
I'm aware of the ones contained in the models directory in the OSF project (which are linked in spinalcordtoolbox/deepseg/models.py), but are there other models somewhere else?

additional models are cooking (work of @charleygros @andreanne-lemay @lrouhier @olix86), it is difficult to predict at this point how many models and how large each will be.

Also, the sct-data contains some deepseg models for the moment, are we moving those at some point?

i guess it depends what solution we settle in

jcohenadad on 18 Jun 2020

👍3

Relates to https://github.com/neuropoly/ivadomed/issues/253

zougloub on 19 Jun 2020

Assuming most models (>90%) will be <100MB, is that reasonable to go with Github, and for the remaining 10% (which might not be for production, but only for specific collaborations) go with OSF?

jcohenadad on 19 Jun 2020

Summary:

One repository for each model, no bundling of unrelated models
Small model -> standard repo
Big model -> TBDiscussed
- OSF hosting doesn't have the size limitation. It could be used for (big) models which are more like single-file repos and don't really have a lot of churn (just a few revisions). We'd need a CDN on top of that if OSF is the only source, and the development workflow is non-standard.
- LFS is the simplest to use, is seamlessly integrated with github but reduces portability and locks in to github. You just need to pay and you get the entreprise solution.
- git-annex & others gives storage infrastructure freedom, but it takes a different storage infrastructure and integration with github is not seamless.
- script-assisted file merge/split doesn't need any new infrastructure, but changes the developer workflow to include split/merge for checkout/commit (we could have a template repo, I'll make a demo).
The release script from @saulo-p is applicable to all cases (it creates release artifacts, and sends to OSF)
@jcohenadad can create the organization for hosting the models
@saulo-p can create the repos for all the models available on OSF
Need to identify models who are too big from what already exists (by pushing to github standard repo or asking)

zougloub on 19 Jun 2020

👍2

split/merge lousy example (yeah, the principle is clunky, but it has the merit of working):

# Template repo
cJ@pouet ~/splitmerge> ls
Makefile  README
cJ@pouet ~/splitmerge> cat README

Usage
#####

Because of a github limitation, the files in the repository have to be smaller
than 100 MB. Also, we don't want to pay for git LFS.

Before committing::

   make split

After checkout::

   make merge


cJ@pouet ~/splitmerge> cat Makefile
.PHONY: merge split

merge:
        find pouet-* | sort | xargs cat > pouet
        rm -f pouet-*

split:
        rm -f pouet-*
        split --bytes=1000000 -d pouet pouet-
        rm -f pouet

# Add "model"
cJ@pouet ~/splitmerge> dd if=/dev/random bs=1M count=10 of=pouet
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.078636 s, 133 MB/s

cJ@pouet ~/splitmerge> ls
Makefile  README  pouet

# Prepare for commit
cJ@pouet ~/splitmerge> make split
rm -f pouet-*
split --bytes=1000000 -d pouet pouet-
rm -f pouet
cJ@pouet ~/splitmerge> git add pouet-*
gcJ@pouet ~/splitmerge> git commit -m "pouet"
[master (root-commit) 53cf112] pouet
 12 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 README
 create mode 100644 pouet-00
 create mode 100644 pouet-01
 create mode 100644 pouet-02
 create mode 100644 pouet-03
 create mode 100644 pouet-04
 create mode 100644 pouet-05
 create mode 100644 pouet-06
 create mode 100644 pouet-07
 create mode 100644 pouet-08
 create mode 100644 pouet-09
 create mode 100644 pouet-10
cJ@pouet ~/splitmerge(master)> ls
Makefile  README  pouet-00  pouet-01  pouet-02  pouet-03  pouet-04  pouet-05  pouet-06  pouet-07  pouet-08  pouet-09  pouet-10

# After checkout
cJ@pouet ~/splitmerge(master)> make merge
find pouet-* | sort | xargs cat > pouet
rm -f pouet-*
cJ@pouet ~/splitmerge(master)> ls
Makefile  README  pouet

zougloub on 19 Jun 2020

🚀1

~here → https://github.com/ivadomed-data~
UPDATED 2020-06-22 12:02:49: now using: https://github.com/ivadomed

jcohenadad on 19 Jun 2020

I'm not sure git-lfs was meant for lock-in. Its default algorithm for finding the filestore is Github-centric -- especially, it demands a client-server model, there's no way to do decentralized peer-to-peer git with it -- but they provide https://github.com/git-lfs/lfs-test-server which you can self-host (some expletive-filled install instructions for it here, or you can use Gitea.

I think git-lfs is the best option on the table.

You can also choose which files are in or out with git-lfs individually or by file extension or by folder (it uses the same syntax as .gitignore). For example:

git lfs track *.onnx
git lfs track models/
git lfs track helper.bin

kousu on 22 Jun 2020

A flexible option would git-annex. It also works with regular git repos and stores the large content elsewhere off to the side, enabling lazy checkouts. It even, get this, can write to git-lfs -- though probably with different folder structure than straight git-lfs would make. We could make all the models git-annex repos (or datalad repos; same thing, pretty much), configure them with a Github git-lfs mirror and a osf mirror, and even configure our internal smb:// servers as mirrors too, so we can be less worried about lock-in.

But I would rather just use git-lfs straight. We could deploy Gitea on git.neuro.polymtl.ca to have internal mirrors.

kousu on 22 Jun 2020

thank you for your insights @kousu. So if we go with git-lfs, are you suggesting that we primarily host the data in a server at polytechnique? i'm a bit worried about maintenance (should i worry?). Or were you suggesting to only use git.neuro.polymtl.ca as additional mirrors? If the latter, where would the primary host be located? I'm definitely OK paying for a remote host, i'm just not sure what price ranges we are talking about: <$500/y would be ok for a solution with high BW, redundancy, accessible everywhere on the planet, not worrying about maximum BW or storage space. Is that reasonable?

jcohenadad on 22 Jun 2020

related to:

jcohenadad on 22 Jun 2020

My "vote", considering the context, would be for git-lfs on github.

zougloub on 22 Jun 2020

I agree. git-lfs on github as the primary mirror with a low-priority plan to deploy an internal mirror.

We could also mirror to gitlab.com; they even have a github importer (I don't know if it imports the LFS too, but I would hope it does; if not, we can just write our own).

It's too bad that git-lfs is so client-server centric! The one really nice feature of git-annex is you can use any git repo as a large file store.

kousu on 22 Jun 2020

(sorry! I mis-clicked!)

Another nice feature of git-annex is git config annex.largefiles 'largerthan=100kb. You can ask it to make the small model/big model decision for you (and of course you can always override it explicitly). Can we live with keeping all models in LFS?

kousu on 22 Jun 2020

We will probably have to pay Github for being our CDN. Cost is 1USD/10GB/month: https://help.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/about-billing-for-git-large-file-storage#purchasing-additional-storage-and-bandwidth.

Re @zougloub:

I don't have a problem with LFS except that it's raising the bar for hosting, and this is really about another github.com limitation in order to monetize their platform. I'm also wondering what happens when attempting to fork a paying customer's publig LFS-enabled repo (@Drulex ?).

from that page:

Bandwidth and storage usage only count against the repository owner's quotas. In forks, bandwidth and storage usage count against the root of the repository network. Anyone with write access to a repository can push files to Git LFS without affecting their personal bandwidth and storage quotas or purchasing data packs. Forking and pulling a repository counts against the parent repository's bandwidth limit.

kousu on 22 Jun 2020

Bandwidth and storage usage only count against the repository owner's quotas. In forks, bandwidth and storage usage count against the root of the repository network. Anyone with write access to a repository can push files to Git LFS without affecting their personal bandwidth and storage quotas or purchasing data packs. Forking and pulling a repository counts against the parent repository's bandwidth limit.

i don't have enough knowledge to read between the lines here. Will the 1USD/10GB/month be affected by:

storage (i assume so)
users installing SCT? (hopefully not, because difficult to properly scale)
CI? (hopefully not, for obvious reasons)

also (related to the bullet list above), does this quota include both storage and BW?

jcohenadad on 22 Jun 2020

I can't make up my mind. @zougloub is right, practically git-lfs means Github. It's so much simpler to use than git-annex or datalad, but it has some downsides.

But git-annex can plug into a huge range of storage services, including, as of this week, OSF.io; and I am prototyping the same thing for openneuro.org today. Those last two are funded by research money, and I think they have larger file size limits than Github's 2GB.

In another world, both OSF.io and openneuro.org have deployed git-lfs servers and we wouldn't have to make a compromise.

kousu on 22 Jun 2020

does this quota include both storage and BW?

The quota is really two quotas, one for storage and one for bandwidth: you can store as much per month as you can download/upload.

This is very unusual; bandwidth is usually sold larger than storage, because you're usually uploading something in order to broadcast it many times. e.g. on https://www.digitalocean.com/pricing/ the bandwidth/storage ratio ranges from 1TB/25GB ~= 40x down to 12TB/3.25TB ~= 3.5x (and in practice you don't usually serve your entire 25GB disk to clients; you'll keep 10GB for OS and webapps and maybe 1GB of actual content).

Will the 1USD/10GB/month be affected by:
* storage (i assume so)

Of course.

* users installing SCT? (hopefully not, because difficult to properly scale)

Yes, definitely, that counts against the bandwidth quota. If we go with git-lfs then sct_download_data -d optic_model would become

git clone https://github.com/neuropoly/sct-optic-model.git optic_model

with git-lfs kicking in underneath.

* CI?  (hopefully not, for obvious reasons)

If CI is Travis, then yes, definitely. I don't know if Github would count against their own storage limits if we use Github Actions. We can do an experiment: make a small repo (with a 1GB limit), commit a 100MB file to it, run a CI script in travis that just, say, downloads and checksums the file, see how many times we can do that. Then make a different small repo and do the same with Github Actions.

If we used git-annex (and/or datalad) then at install time the installer would do

git clone https://github.com/neuropoly/sct-optic-model.git optic_model
(cd optic_model; git annex get .);

or even

git clone https://github.com/neuropoly/sct-optic-model.git optic_model
(cd optic_model; git annex get --from=OSF .);

the difference here is that git-annex makes it easier to plug in alternate mirrors.

kousu on 22 Jun 2020

I have a wild idea for a side project. git-lfs allows configuring the LFS server per-remote (i.e. per mirror). But it only supports https://, which actually means this lfs+https:// protocol. If we patched git-lfs to support, say, osf:// or openneuro:// or dropbox:// URLs we could marry the best features of both.

kousu on 22 Jun 2020

given that 95% of our compiled binaries / MRI+histo data / DL models are <100MB (per file), maybe we can go ahead with the git/GitHub approach (as originally planned), with one repos per dataset, and see if we are happy with that approach? then for the 5% remaining we could implement @zougloub's suggestion (https://github.com/neuropoly/spinalcordtoolbox/issues/2746#issuecomment-646667197)?

jcohenadad on 23 Jun 2020

Note that sct_download_data & friends wouldn't get data using a git clone but rather a download from the release tarballs (made using @saulo-p's WIP release script), which aren't metered in the same way.
That plus the fact that the quota is per-repo, I don't think we we will hit bandwidth issues, and I would be tempted to say that it's rather safe to use github's git-lfs, for the main reason that the researcher workflow would be the simplest; git annex or the split+merge workaround work on github, it will be one more hoop to jump through for the developers, I'm not sure how they would appreciate it.

There is one more option... we could also run these new repos on someone else's computer, for instance gitlab (which BTW also doesn't require to create new top-level organizations...) on gitlab.com (https://about.gitlab.com/solutions/education/) or a hosted gitlab instance; gitlab.com has repo limitations but they are friendlier and LFS is not required; self-hosted has no limitation.
Since you said you would be ready to pay a little for this, I'm pretty sure you could get practically-unlimited storage+bandwidth in Polytechnique's IT or on a VPS from our friends at OVH, at the cost of requiring new developer registrations).

zougloub on 23 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Moving away from travis.org

jcohenadad · 20Comments

SCT not on $PATH under zsh on macOS 10.15

AdiVeeMiami · 38Comments

Running sct_testing on poq_issue_732 branch #977 in Slicer Anconda Python Interactor

mrhardisty · 61Comments

TypeError: 'NoneType' object is not iterable

jcohenadad · 24Comments

Start Sphinx documentation

jcohenadad · 24Comments