Binderhub: Allow use of Git Large File Storage

Created on 4 Dec 2017 · 9Comments · Source: jupyterhub/binderhub

Git is an imperfect tool for handling large files, and so Git Large File Storage (GLFS), an extension to Git, was developed. GLFS stores a small text file with a pointer to a larger file stored elsewhere, easing the burden on Git. A Git client with the GLFS extension installed will automatically download the large files when a repository is cloned.

Binder does not have GLFS installed. Thus when you clone a repository that uses GLFS, you get pointers to the data, not the data.

Here is an example repository:

https://github.com/suchow/binder-glfs

I propose that you install GLFS and use it when cloning the repository from GitHub.

If someone could point me towards the step where the repository is cloned, etc., I could take a stab at this and open a PR.

configuration documentation

Source

suchow

👍1

Most helpful comment

(Moved to https://github.com/jupyter/repo2docker/issues/162.)

suchow on 4 Dec 2017

👍2 🎉1

All 9 comments

Thanks for opening this issue! The tool used to clone and build repositories is https://github.com/jupyter/repo2docker so we should move the discussion there.

betatim on 4 Dec 2017

(Moved to https://github.com/jupyter/repo2docker/issues/162.)

suchow on 4 Dec 2017

👍2 🎉1

@betatim @willingc I was able to run jupyter-repo2docker on a Git repo that uses GLFS without needing to make any modification — all I did was install GLFS and run jupyter-repo2docker. It worked :smiley:. I think that means that, in fact, this issue is better handled here? All we'd need to do is to install GLFS on the build machine, wherever that is, and then everything should work without further modification.

suchow on 5 Dec 2017

This would probably require we install git-lfs into the dockerfile inside repo2docker.

I'm not entirely opposed to doing it, but note that it's quite limited on GitHub - esp, it limits your repo to 1GB of bandwidth. Also, per https://help.github.com/articles/about-storage-and-bandwidth-usage/#bandwidth-quota:

If you use more than 1 GB of bandwidth per month without purchasing a data pack, Git LFS support is disabled on your account until the next month.

which means we'll probably get throttled on the public mybinder.org instance pretty fast :) Given that, maybe we shouldn't install it so we have a consistent experience (nobody gets LFS!) rather than a 'first person on a month to clone a LFS repo gets it everyone else does not!).

It should definitely be an option for private installs though!

yuvipanda on 5 Dec 2017

I've seen a number of folks storing notebooks in LFS (and I would recommend it!) because of the image content, so I think maybe we should support it by default? The quota usage is definitely something we should be thinking about. Since we only clone a given ref once to build the image (so long as we aren't invalidating our build cache too often), Binder ought to not be too big a consumer of quotas, right?

Do we do shallow clones in repo2docker?

minrk on 5 Dec 2017

Do we do shallow clones in repo2docker?

Not at the moment. I added it in https://github.com/jupyter/repo2docker/pull/120 but we reverted in https://github.com/jupyter/repo2docker/pull/130. Couldn't work out a reliable way to detect if we were in a shallow clone and unshallowing it. It should be easy (git has commands for it...) but somehow I didn't make it work reliably.

betatim on 5 Dec 2017

Can anyone help me out with this? (New to coding)
Trying to run a 1.7GB trained model through binder.
I am able to !wget the .bin file from github lfs. But, when I try to run it using the KeyedVector function on sklearn, the server crashes and restarts. Is it just because of the size or am I making some obvious error?
https://github.com/prathamesh1993/Clinical-Acronym-disambiguation/blob/master/DM_5_class_pilot_acronym_to_longform.ipynb