Cdnjs: Remove the need for manual moderation

Created on 13 Nov 2019  Â·  20Comments  Â·  Source: cdnjs/cdnjs

CDNJS requires a certain amount of manual review and moderation. This review is required both to allow new projects to be added to the project, and to handle certain types of project changes which can't be automatically imported.

This issue is a place to discuss how we might eliminate this requirement in the future.

I believe the goal of this project should be to take everything which requires a manually-reviewed PR now, and make it happen automatically. I have categorized a snapshot of the currently open Pull Requests here: https://docs.google.com/spreadsheets/d/18-HyNKxfXvzCLr6v57UrGHtchvDhNcrI8JvGn4gUwck/edit#gid=0

As you can see the majority of these issues can be divided into:

  • The CI errored, the user then fixed the error, but it now requires a human to review
  • A human has to look at the code to confirm the project is:

    • Popular enough

    • Includes the right code (can't really be verified)

    • Has the right glob pattern of files to allow for auto-updating in the future

  • A handful of anomalous conditions which are likely less important to automate

With that we would love to hear the communities ideas for how to eliminate the labor and danger of this manual review!


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

High Priority

Most helpful comment

+1 to the first proposal - this is something I've "considered" for some time. It allows for humans and ccontributors to work with "cdnjs" far more easily, whilst still retaining the current "model" we have where all the files that are hosted live in a repo.

All 20 comments

I'm going to write down a few of the proposals which came through the Slack channel to start the conversation.

Two Repos

At present, the cdnjs/cdnjs repo is generated by a combination of humans and bots. The humans are required to validate packages which then can often be updated automatically by a bot. One proposal which came out of the Slack channel is moving to a model where there are two repos:

  • cdnjs/config
  • cdnjs/cdnjs

The config repo would contain all of the package.json-type files which are necessary to import projects into cdnjs. It essentially represents pointers from project names to where they live in github or npm, and any special cdnjs-configuration. That repo would be the only one humans ever touch.

The other repo (this one) would be entirely maintained by bots who keep its contents in sync with the origins (npm or github) specified in the config. Projects which are not on one of those platforms will not be incluable in CDNJS.

This change will accomplish a few things:

  1. It will mean humans rarely need to download and touch this massive repo (195 GB and counting). Given that humans don't have to touch it it may be viable to allow projects to include all of their files in CDNJS, no longer requiring the error-prone inclusion/exclusion of files.
  2. It will mean this repo is always a valid rendition of the actual files in the respective projects. Without human involvement it will make it much harder for the contents of this project to be tampered with.
  3. It will retain the existence of a CDNJS repo which contains an auditable trail of exactly what is being served on cdnjs.cloudflare.com, and which is not dependent on git or npm to serve files (just to update them).

npm-direct

The disadvantage of the previous proposal is it still requires each project to be manually added to CDNJS. We can likely automate that addition for projects which are added with the same name they use in npm, but it is nevertheless a step which project-owners have to perform before their projects can be used by the community. An alternative would be if any project could be used through cdnjs. For example, if I visit:

cdnjs.cloudflare.com/jquery/3.4.1/dist/jquery.js

I could get that file from the relevant npm project without jquery ever having been included in CDNJS. This completely eliminates the maintenance overhead, but it also means there is no single place where all of CDNJS lives. It would mean you can't trust CDNJS will be up even when npm is down (although caching will help), and it means you can't look back through time to see what was served when if you don't trust npm.

Config in source projects

This somewhat combines the first two proposals. Projects would be included based on their existence on github, but could specify unique cdnjs config as a source file. We would import the project and load the package.json (for example) which would specify any specific CDNJS config (like unique file mappings). We could either do this in advance (requiring them to 'import' the project), or we could do it on the fly and cache the results.

I like the first proposal. I think NPM direct is out since not all projects use NPM (plus it would mean CloudFlare would have to have potentially everything on NPM on their edge servers)

+1 to the first proposal - this is something I've "considered" for some time. It allows for humans and ccontributors to work with "cdnjs" far more easily, whilst still retaining the current "model" we have where all the files that are hosted live in a repo.

I like the first proposal. I think NPM direct is out since not all projects use NPM (plus it would mean CloudFlare would have to have potentially everything on NPM on their edge servers)

We only cache, not store, at the edge, so that isn't a concern. (npm is actually built on Cloudflare Workers, we already cache many of those files ;) ).

To allow for projects which don't use npm we could either... ask them to use npm, or support GitHub as an alternative source.

I’m still of the opinion that the first idea, where we split into two repos but maintain the existing concept is the best first step.

That’ll alleviate the size issue for maintaining the project as cdnjs/cdnjs will become robot-only.

Could I suggest the new repo that will contain all the package.json files is called cdnjs/packages?

There are a few blockers from this “split” happening immediately, in my mind:

  • Packages that have no auto-update don’t fit into this model.

    • Suggestion: still treat them the same during the “split” but don’t plan to update them ever again. They become legacy.

  • Auto-update currently needs a human to upload a new version if the file structure of a lib changes.

    • Suggestion: Auto-update should just allow changing file structures.

    • Alternative: The bot simply requests a yes/no from maintainers to add the new version, but handles the adding itself.

  • Adding a new lib (or updating the config on an existing) with auto-update currently requires that you also include the files for the new version for the CI to pass.

    • Suggestion: CI should fetch the files based on the auto-update config and use those to validate the config, don’t require the user to include them in the PR.

From there we can then look at how we can further reduce the human workload, for example, fully automating the process for requesting/adding a new library etc.

npm-direct is probably bad idea because there is already https://unpkg.com/ for this (which is by the way also hosted by CloudFlare). You need to ask yourselves if you want to continue cdnjs forever or gradually deprecate it in for example in favor of https://unpkg.com/

In any case I can tell that in order to avoid manual moderation:

  1. You need to stop updating or adding packages that cannot be auto-updated. If you want to support updating packages that currently cannot be auto-updated by simple glob, you could allow specifying custom build script in auto-update config (but it has security issues)
  2. New configuration repo is needed, at least with all auto-update configuration from this repo
  3. You need to subscribe for existing name registry for any new packages (npm's is obvious choice)

Could I suggest the new repo that will contain all the package.json files is called cdnjs/packages?

Is there anyone who can and wants to do it?

Assuming we go with this strategy, I think the timeline for getting there would be something like:

Development

  • [x] Create cdnjs-test org
  • [x] Duplicate cdnjs/cdnjs to cdnjs-test/cdnjs
  • [x] Create test issue on cdnjs-test/cdnjs
  • [x] Rename cdnjs-test/cdnjs to cdnjs-test/packages
  • [x] Create empty cdnjs-test/cdnjs repo
  • [x] Duplicate cdnjs-test/packages to cdnjs-test/cdnjs (see packages-migration doc)
  • [x] Create a script to update cdnjs-test/packages to only contain package.json files

    • As (in "production") we'll rename cdnjs/cdnjs to cdnjs/packages so that issues etc. can be preserved into the new "human" repo, we also bring over the repo size and history. To resolve this, the update script should use git commands to remove all non-package.json files from the entire commit history, whilst preserving the commit history with any changes to the package.json files (e.g. version being bumped)

    • In terms of cleaning cdnjs/packages to retain the commit history whilst removing all the non-package.json files so it becomes a sensible size, we can use bfg with a glob of ajax/libs/*/*/**/*.

  • [ ] Update documentation in cdnjs-test/packages to reflect that being the "human" repo where only package.json files are kept
  • [ ] Update documentation in cdnjs-test/cdnjs to reflect it is now a bot-only repo that contains all the CDN assets
  • [ ] Setup GitHub Actions in cdnjs-test/cdnjs to automatically close all PRs (wording that explains bot-only repo and sends them to the packages repo)
  • [ ] Rewrite cdnjs bot to work from package.json files contained in the cdnjs-test/packages repo and update the assets + package.json files in cdnjs-test/cdnjs

Production

  • [ ] Rename cdnjs/cdnjs to cdnjs/packages (preserving PRs/issues in the "human" repo)
  • [ ] Create empty cdnjs/cdnjs repo
  • [ ] Duplicate cdnjs/packages to cdnjs/cdnjs (see packages-migration doc)
  • [ ] Run the update script on cdnjs/packages to make it package.json files only (see packages-migration doc)
  • [ ] Update documentation in cdnjs/cdnjs & cdnjs/cdnjs from cdnjs-test
  • [ ] Copy GitHub Actions setup from cdnjs-test/cdnjs to cdnjs/cdnjs
  • [ ] Enable new update bot on cdnjs org

Create cdnjs-test org

I don't think it's necessary.. This org is fine.

Duplicate cdnjs/cdnjs to cdnjs-test/packages
Create a script to update cdnjs-test/packages to only contain package.json files

I would suggest to create configuration repository from scratch to avoid unnecessary 100+GB in git history..

Rename cdnjs/cdnjs to cdnjs/packages (preserving PRs/issues in the "human" repo)

Please don't. Who knows what code depends on this repo being cdnjs/cdnjs and is not supporting redirects.

I think that:

  • This repo (cdnjs/cdnjs) should remain unchanged and should be managed by bots
  • Fresh repo (either cdnjs/packages or cdnjs/config) should contain auto-update configuration

Having a fresh repo for cdnjs/packages means that we lose the entire history of cdnjs, which is a super imporatnt part of how we operate and ensures auditability. It also means that we leave behind all the existing PRs and issues in a repo that will become robot-only.

I don't think it's necessary.. This org is fine.

Testing potentially destructive scripts in the production org sounds like a super bad idea.

Who knows what code depends on this repo being cdnjs/cdnjs and is not supporting redirects.

cdnjs/cdnjs isn't going, it will still exist, it will just be a duplicate of the existing one, so that issues and PRs are moved over to cdnjs/packages.

In terms of cleaning cdnjs/packages to retain the commit history whilst removing all the non-package.json files so it becomes a sensible size, we can use bfg with a glob of ajax/libs/*/*/**/*.

Testing potentially destructive scripts in the production org sounds like a super bad idea.

I mean that you could create cdnjs/cdnjs-test repo instead of separate org

Seems like a minor thing, either way, separate org for testing this seems safer to me.

Okay, so I've written the majority of the migration scripts/docs: https://github.com/cdnjs/packages-migration. I'm currently running run-bfg.sh against the cdnjs repo to test how it will go. It's currently on library 100 of the total 3452 libs in my copy of cdnjs. It has taken 5 hours to get to this point. Based on that, it's going to take 7 days continually running to properly clean all non-package.json files from the commit history of cdnjs.

I would still prefer to keep the commit history, but this current script doesn't make that realistic. Does anyone have ideas on how we can more quickly clean all non-package.json files from the entire commit history?

If not, I can change the steps so that we rename the repo from cdnjs to packages still so that we preserve issues and PRs, but force push the repo to be a single fresh commit with only package.json files in it.

I think if someone wants access history it can be still accessed on this repository. Also probably we don't need there every package.json, but only top-level ones that that have auto-update info?

Yeah, history will be preserved in the new cdnjs/cdnjs repo - I'd prefer to also keep it in the packages repo but I feel that might be becoming impossible.

Also probably we don't need there every package.json, but only top-level ones that that have auto-update info?

Yeah, whenever I'm referring to package.json-only etc. I mean ajax/libs/<lib>/pacakge.json files, nothing else.

(I'm also running git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch ajax/libs/*/*/**/*" HEAD on another clonne of cdnjs just to see if the native git way to clean files is going to be any faster than bfg)

FWIW the whole reason I created unpkg in the first place was because cdnjs was really difficult to work with. The manual approval process was tedious, yes, but also just the sheer size of the repo meant it took a long time to clone it and checkout new branches. I actually had most of the code written for unpkg one afternoon while I was waiting for cdnjs to do a git checkout.

One really interesting thing that cdnjs has that unpkg does not have is this: your own namespace. In hindsight, I wish unpkg had its own namespace instead of being coupled so tightly to npm. If we did, we might be able to become something more than a simple reverse proxy. We might actually be able to build our own package registry... food for thought :)

I'm going to move the conversation around the package repo migration to #13652 and hide the related comments here so that this issue can be better used for dicussion around the actual removal of manual moderation (CI etc.)

I think we can close this now?

Yeah, I think we're in a good place now with limited manual moderation needed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mikewest picture mikewest  Â·  19Comments

ofsahin picture ofsahin  Â·  50Comments

reezer picture reezer  Â·  19Comments

eastling picture eastling  Â·  18Comments

akoeplinger picture akoeplinger  Â·  20Comments