Cdnjs: [Help wanted] Issue of too large repo, too many files, causing git process slow, need a new structure.

Created on 18 Oct 2014  路  35Comments  路  Source: cdnjs/cdnjs

I was cloning the cdnjs repo, and looking at the following screenshot _(and disregarding the download speed :snail: )_, it seems to be packing *_>1GB *_ of content, even while cloning with a --depth 1

cdnjs

As it stands, this repo is prohibitively large, and with each addition of another new library, it would become larger, making any new additions more expensive.

I'm new to cdnjs, so I'm not sure how it'll work out, or if it's feasible, but still...

I'd like to suggest some way to split this repo into two, one containing just the package.json files for the different libraries, and another which actually contains the libraries. Perhaps the new autoupdate feature would work seamlessly with this?


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Help wanted High Priority

Most helpful comment

I'd hate to see what a low priority issue looks like.

All 35 comments

should tag @thomasdavis @ryankirkman @drewfreyling for comments

Probably break the project into 2 submodules?

For others facing the same issue, I did come upon this blog post by Atlassian.

After completing the shallow checkout with depth=4, it turns out that the /ajax/libs folder is a whopping 4.6 gigabytes in size.

As an aside, perhaps the node_modules folder can be kept off from being committed into the repo. The package.json file takes care of that anyways.

keep node_modules because of it can save time to CI build, but circle CI is fast enough(we just transfer to use it before few weeks), it has good cache mechanism, I'll consider to remove that folder, thanks for your comments.

So we will most likely be moving towards replacing this repo with a repo which only contains package.json's.

Though I will add instructions for people to just clone with depth=3/4 now.

oh that's nice!
umm... any ETA on that? just asking, no pressure :)

BTW, the size of .git behind this repo is currently about 647MB.

Do you have an ETA on an alternative way to handle adding/updating of libraries? This repo is close to the implosion where the process is too cumbersome and people will just not submit their lib to it (not only my impression) so I feel something very fast is needed.

No ETA yet, we don't have enough human resource and idea to handle that, I tried git-lfs and realized that its overhead is too high, we need help on this issue.

We will seemingly just have to manage a normal static directory on a master server somewhere. We can back it up with rsync everytime we update it to emulate version control.

I don't think the GIT way is going to scale any further for this purpose (nor will GitHub). What I feel is needed is a small UI where people can upload their lib to. With a simple form that gets the https URL of the GitHub repo, the package.json in the repo should do the rest.

Eventually, you could maintain that in a git repo that would contain txt files (one per lib) and the file would contain the URL of the git repo. But again, that's not what git was meant for.

We're on the waiting list for Git Large FIle Storage for cdnjs so hopefully that sorts our storage problem out :)

/cc @PeterDaveHello @thomasdavis

@kumarharsh FYI you can use git sparse checkout to only work on the part of the repo you care about. It should help reduce the repo size.

@ryankirkman: Git large file storage is about handling large files, not a large number of files. It will not help one bit here. Git sparse checkout, while helping a bit, is not going to scale either.

What you are looking for here is a way to store a HUGE amount of small files. Git is not the answer for this. I've been working in related fields for more than 20 years now and I'll probably stop following this thread soon. I'll give you my last advice here.

A revision control system is JUST NOT THE RIGHT TOOL FOR THE JOB. STOP LOOKING AT WORKING AROUND GIT. It will not scale. It was never meant to. It will never scale the way you want to no matter what "plugins", or workarounds you find. Just because git is not about doing what you are looking to do. @PeterDaveHello asking for help for git being too slow is not the right message. You don't need a way to get git faster, you need a system to handle the hosting of a large number of files and a system for individual users to update their files. This has nothing to do with git. All work around git is at best going to buy you a few more month, but that is all. And you will get trapped into a system that doesn't work (at least properly) until the death of the project.

What you guys need is a replicated file system (among all the mirrors you host). This is all but an FTP and a rsync. Simple tools exist, just find the proper way to mix them.

@ryankirkman git-lfs won't help, I already tried, the waiting list is for access git-lfs on GitHub, but we can use it with other implement, and I tried, the overhead is too high, it's not designed for small files, and it won't fix our problem.

@pieroxy thank you for your advice, I know that git-lfs won't help, and still finding a solution.

@pieroxy I agree with everything you said there. There is really no way the current system is going to scale at all.

BTW, I wrote the steps to use sparse checkout + shallow/clone(pull) as workaround until we fix this issue.
One of the methods will be a web interface for contributors to update the file and lib and then we all do the works behind our sever, but still commit as the contributor.

Honestly you already have a great tool (cdnjs-importer) that takes a git repo as an input and does the rest for you. Can't you just build a tiny website with an input "library git repo" and a submit button and automate the rest? I feel like submitting a pull request is incredibly overkill.

And by tiny website I mean add a "submit" page to cdnjs.com.

Yes, will do, but no schedule for it yet.

@arasmussen That's a neat idea! :sparkles:
@PeterDaveHello That would be very easy, using cdnjs-importer as library.

I want to maintain many libraries latest! but It is hard to commit. Git repo is too much large.
When I open the SourceTree, The app is down..

Try sparse-checkout + shallow clone:
sparseCheckout.md

+1. I came here to submit a PR to add Imager.js but I don't have the time or bandwidth to wait for the entire git repo to clone. A simple website submission would be ideal.

@fiznool sorry about that, it's on our todo list, for the meantime, you can just send a request issue ticket and then we will add it, tahnks.

@PeterDaveHello What would you guys say to splitting out the package definitions from the actual contents?

this repo (or a new one) could be the destination for package.json files and a separate repository could be where the actual content gets stored. That way when someone is trying to add a library (like I am) they only need to clone the config repo?

I would be willing to help with this as I want to add a library but can't because my editor crashes (and sometimes my shell) as I'm trying to work with this repository.

@clayreimann we'll need much more time on the discussion about how to handle the files, but at the mean time, actually you can submit a PR on GitHub with only a single package.json, please take a look at https://github.com/cdnjs/cdnjs/pull/7149

Where will (is?) that discussion happening?

@clayreimann some on GitHub some on gitter, sorry that in fact there is no full time developer here, so in fact the things may be too messy, but I really appreciate that you would like to help solve this problem.

@PeterDaveHello Is there a place where the architecture of cdnjs is described? i.e. where is it hosted (gh-pages?), where does cloudflare get the assets it's caching, where does PeterBot live?

@clayreimann : CloudFlare pulls CDNJS repo periodically to their edge servers, PeterBot lives on my own VPS.

I'd hate to see what a low priority issue looks like.

We are working to split how cdnjs works so that there is a much smaller repository for humans to work with. Keep an eye on cdnjs/pacakges.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ofsahin picture ofsahin  路  50Comments

mikewest picture mikewest  路  19Comments

akoeplinger picture akoeplinger  路  20Comments

homerjam picture homerjam  路  26Comments

PeterDaveHello picture PeterDaveHello  路  64Comments