CDNJS requires a certain amount of manual review and moderation. This review is required both to allow new projects to be added to the project, and to handle certain types of project changes which can't be automatically imported.
This issue is a place to discuss how we might eliminate this requirement in the future.
I believe the goal of this project should be to take everything which requires a manually-reviewed PR now, and make it happen automatically. I have categorized a snapshot of the currently open Pull Requests here: https://docs.google.com/spreadsheets/d/18-HyNKxfXvzCLr6v57UrGHtchvDhNcrI8JvGn4gUwck/edit#gid=0
As you can see the majority of these issues can be divided into:
With that we would love to hear the communities ideas for how to eliminate the labor and danger of this manual review!
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
I'm going to write down a few of the proposals which came through the Slack channel to start the conversation.
At present, the cdnjs/cdnjs repo is generated by a combination of humans and bots. The humans are required to validate packages which then can often be updated automatically by a bot. One proposal which came out of the Slack channel is moving to a model where there are two repos:
The config repo would contain all of the package.json-type files which are necessary to import projects into cdnjs. It essentially represents pointers from project names to where they live in github or npm, and any special cdnjs-configuration. That repo would be the only one humans ever touch.
The other repo (this one) would be entirely maintained by bots who keep its contents in sync with the origins (npm or github) specified in the config. Projects which are not on one of those platforms will not be incluable in CDNJS.
This change will accomplish a few things:
The disadvantage of the previous proposal is it still requires each project to be manually added to CDNJS. We can likely automate that addition for projects which are added with the same name they use in npm, but it is nevertheless a step which project-owners have to perform before their projects can be used by the community. An alternative would be if any project could be used through cdnjs. For example, if I visit:
cdnjs.cloudflare.com/jquery/3.4.1/dist/jquery.js
I could get that file from the relevant npm project without jquery ever having been included in CDNJS. This completely eliminates the maintenance overhead, but it also means there is no single place where all of CDNJS lives. It would mean you can't trust CDNJS will be up even when npm is down (although caching will help), and it means you can't look back through time to see what was served when if you don't trust npm.
This somewhat combines the first two proposals. Projects would be included based on their existence on github, but could specify unique cdnjs config as a source file. We would import the project and load the package.json (for example) which would specify any specific CDNJS config (like unique file mappings). We could either do this in advance (requiring them to 'import' the project), or we could do it on the fly and cache the results.
I like the first proposal. I think NPM direct is out since not all projects use NPM (plus it would mean CloudFlare would have to have potentially everything on NPM on their edge servers)
+1 to the first proposal - this is something I've "considered" for some time. It allows for humans and ccontributors to work with "cdnjs" far more easily, whilst still retaining the current "model" we have where all the files that are hosted live in a repo.
I like the first proposal. I think NPM direct is out since not all projects use NPM (plus it would mean CloudFlare would have to have potentially everything on NPM on their edge servers)
We only cache, not store, at the edge, so that isn't a concern. (npm is actually built on Cloudflare Workers, we already cache many of those files ;) ).
To allow for projects which don't use npm we could either... ask them to use npm, or support GitHub as an alternative source.
I’m still of the opinion that the first idea, where we split into two repos but maintain the existing concept is the best first step.
That’ll alleviate the size issue for maintaining the project as cdnjs/cdnjs will become robot-only.
Could I suggest the new repo that will contain all the package.json files is called cdnjs/packages?
There are a few blockers from this “split” happening immediately, in my mind:
From there we can then look at how we can further reduce the human workload, for example, fully automating the process for requesting/adding a new library etc.
npm-direct is probably bad idea because there is already https://unpkg.com/ for this (which is by the way also hosted by CloudFlare). You need to ask yourselves if you want to continue cdnjs forever or gradually deprecate it in for example in favor of https://unpkg.com/
In any case I can tell that in order to avoid manual moderation:
Could I suggest the new repo that will contain all the package.json files is called cdnjs/packages?
Is there anyone who can and wants to do it?
Assuming we go with this strategy, I think the timeline for getting there would be something like:
cdnjs/cdnjs
to cdnjs-test/cdnjs
cdnjs-test/cdnjs
cdnjs-test/cdnjs
to cdnjs-test/packages
cdnjs-test/cdnjs
repocdnjs-test/packages
to cdnjs-test/cdnjs
(see packages-migration
doc)cdnjs-test/packages
to only contain package.json
filescdnjs/cdnjs
to cdnjs/packages
so that issues etc. can be preserved into the new "human" repo, we also bring over the repo size and history. To resolve this, the update script should use git commands to remove all non-package.json
files from the entire commit history, whilst preserving the commit history with any changes to the package.json
files (e.g. version being bumped)cdnjs/packages
to retain the commit history whilst removing all the non-package.json
files so it becomes a sensible size, we can use bfg
with a glob of ajax/libs/*/*/**/*
.cdnjs-test/packages
to reflect that being the "human" repo where only package.json
files are keptcdnjs-test/cdnjs
to reflect it is now a bot-only repo that contains all the CDN assetscdnjs-test/cdnjs
to automatically close all PRs (wording that explains bot-only repo and sends them to the packages repo)package.json
files contained in the cdnjs-test/packages
repo and update the assets + package.json
files in cdnjs-test/cdnjs
cdnjs/cdnjs
to cdnjs/packages
(preserving PRs/issues in the "human" repo)cdnjs/cdnjs
repocdnjs/packages
to cdnjs/cdnjs
(see packages-migration
doc)cdnjs/packages
to make it package.json
files only (see packages-migration
doc)cdnjs/cdnjs
& cdnjs/cdnjs
from cdnjs-test
cdnjs-test/cdnjs
to cdnjs/cdnjs
cdnjs
orgCreate cdnjs-test org
I don't think it's necessary.. This org is fine.
Duplicate cdnjs/cdnjs to cdnjs-test/packages
Create a script to update cdnjs-test/packages to only contain package.json files
I would suggest to create configuration repository from scratch to avoid unnecessary 100+GB in git history..
Rename cdnjs/cdnjs to cdnjs/packages (preserving PRs/issues in the "human" repo)
Please don't. Who knows what code depends on this repo being cdnjs/cdnjs
and is not supporting redirects.
I think that:
Having a fresh repo for cdnjs/packages means that we lose the entire history of cdnjs, which is a super imporatnt part of how we operate and ensures auditability. It also means that we leave behind all the existing PRs and issues in a repo that will become robot-only.
I don't think it's necessary.. This org is fine.
Testing potentially destructive scripts in the production org sounds like a super bad idea.
Who knows what code depends on this repo being cdnjs/cdnjs and is not supporting redirects.
cdnjs/cdnjs isn't going, it will still exist, it will just be a duplicate of the existing one, so that issues and PRs are moved over to cdnjs/packages.
In terms of cleaning cdnjs/packages
to retain the commit history whilst removing all the non-package.json
files so it becomes a sensible size, we can use bfg
with a glob of ajax/libs/*/*/**/*
.
Testing potentially destructive scripts in the production org sounds like a super bad idea.
I mean that you could create cdnjs/cdnjs-test
repo instead of separate org
Seems like a minor thing, either way, separate org for testing this seems safer to me.
Okay, so I've written the majority of the migration scripts/docs: https://github.com/cdnjs/packages-migration. I'm currently running run-bfg.sh
against the cdnjs repo to test how it will go. It's currently on library 100 of the total 3452 libs in my copy of cdnjs. It has taken 5 hours to get to this point. Based on that, it's going to take 7 days continually running to properly clean all non-package.json
files from the commit history of cdnjs.
I would still prefer to keep the commit history, but this current script doesn't make that realistic. Does anyone have ideas on how we can more quickly clean all non-package.json
files from the entire commit history?
If not, I can change the steps so that we rename the repo from cdnjs
to packages
still so that we preserve issues and PRs, but force push the repo to be a single fresh commit with only package.json
files in it.
I think if someone wants access history it can be still accessed on this repository. Also probably we don't need there every package.json, but only top-level ones that that have auto-update info?
Yeah, history will be preserved in the new cdnjs/cdnjs
repo - I'd prefer to also keep it in the packages
repo but I feel that might be becoming impossible.
Also probably we don't need there every package.json, but only top-level ones that that have auto-update info?
Yeah, whenever I'm referring to package.json
-only etc. I mean ajax/libs/<lib>/pacakge.json
files, nothing else.
(I'm also running git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch ajax/libs/*/*/**/*" HEAD
on another clonne of cdnjs just to see if the native git way to clean files is going to be any faster than bfg)
FWIW the whole reason I created unpkg in the first place was because cdnjs was really difficult to work with. The manual approval process was tedious, yes, but also just the sheer size of the repo meant it took a long time to clone it and checkout new branches. I actually had most of the code written for unpkg one afternoon while I was waiting for cdnjs to do a git checkout
.
One really interesting thing that cdnjs has that unpkg does not have is this: your own namespace. In hindsight, I wish unpkg had its own namespace instead of being coupled so tightly to npm. If we did, we might be able to become something more than a simple reverse proxy. We might actually be able to build our own package registry... food for thought :)
I'm going to move the conversation around the package repo migration to #13652 and hide the related comments here so that this issue can be better used for dicussion around the actual removal of manual moderation (CI etc.)
I think we can close this now?
Yeah, I think we're in a good place now with limited manual moderation needed.
Most helpful comment
+1 to the first proposal - this is something I've "considered" for some time. It allows for humans and ccontributors to work with "cdnjs" far more easily, whilst still retaining the current "model" we have where all the files that are hosted live in a repo.