Per the discussion in https://github.com/nodejs/security-advisories/pull/13 (thanks @mhdawson!), I wanted to follow up with an issue to discuss indices for the two kinds of security vulnerabilities that are easily parsable and have a low barrier to entry for end-users who'd like to consume this data.
Apologies for not opening an issue first, as I'd indicated I would in the meeting – I let enough time pass between the meeting and working on that PR that I'd forgotten I'd said I'd do that ❤️
@mhdawson raised two points in that issue:
I'm +1 on this
I believe that index generation automation is something we already have - the tooling. The question here is more about whether we want to actual execute it and include on part of the repo (which I'm good with).
How about before actually implementing we can have a plan and general design of how the index would work? Such as, consider the following:
vulns/indexWDYT @bnb?
@lirantal that sounds good to me, and effectively reflects the implementation that exists in https://github.com/nodejs/security-advisories/pull/13 – given the possible caveat of on PR vs. on push, unless you meant to have it happen in this repository 😅
If that was a strawman suggestion then it was a very good one 👍
Once security-advisories is out it should live there indeed but that's just semantics for now.
So yes, just dumped an idea for how it would work. Let's see if others can find issues with this flow or confirm it. To be clear, I intend for the indexes to be created in the CI on master. Because we require all branches to be up to date with the tip of master this would ensure we have no conflicts.
Any more feedback from other @nodejs/security-wg members? ❤️
I think splitting the conversation made it harder to follow. I commented on the PR.
It isn't clear what feedback you want, perhaps asking specific questions will get more response?
@sam-github quoting @mhdawson here (emphasis mine):
Sorry for slowing things down but I think we need good consensus before introducing external data that people will depend on and may be hard to stop producing later on....
My understanding of the point of this issue (sorry if I didn't make that clear) is to build consensus that these files should exist.
I've a couple concernes:
1 - the files are trivially creatable by consumers, perhaps the script you wrote could even be provided to build the index files for people who want it
2 - the index files will grow without bounds, and git won't deal well with them. git does pretty well with single immutable files, as single ones get added, but one large one that changes with each commit may be a problem
3 - the index doesn't have many permutations. its pretty easy to lookup the json files by id already, since the id is part of the file name, so its not a whole lot better than file names as lookup (though it is a bit easier)
Perhaps it would be better for someone to publish a mirror + index somewhere? Perhaps even an npm publish of the data + index + js API could be published to npmjs.com on every travis merge? I personally like the low maintenance overhead of letting distribution mechanisms evolve & innovate outside of the core repo used to store the vulns.
Those concerns aside, I'm not sure how many consumers of this data there are, or how big a barrier the lack of index is, maybe we can deal with the size problem later and easing adoption of the data is more important in the short term.
@bnb do you have a personal use for this? do you have any concerns about the endless growth of the json index files?
I'd like to address these two points:
1 - the files are trivially creatable by consumers, perhaps the script you wrote could even be provided to build the index files for people who want it
Having worked on the only (?) module (node-release-lines) that consumes this data, this is an assumption of triviality that does not hold up. Having to pull a git repo and consume + build the data from that _or_ reading an unknown amount of files from GitHub raw URLs is 100% not trivial.
Consuming this data was – by far – the most difficult part of node-release-lines, and it wasn't even the main feature. This was one of the biggest blockers in an online mode, rather than just updating every time there's a new update to this data or the release data.
- the index doesn't have many permutations. its pretty easy to lookup the json files by id already, since the id is part of the file name, so its not a whole lot better than file names as lookup (though it is a bit easier)
I would agree with this based on how you define look up. For a human, sure. It is less easy when you need to do it with a computer – as mentioned, either clone the repo and consume the data that way or have an index of pointers to files.
Having pointers to files is one approach that was mentioned but it is only slightly better than the current approach. It still places the burden of instrumenting requests, JSON parsing, and object building out to the user.
By taking the approach of building an index that enhances the JSON every time something new is added, we're lifting the burden of dozens (hundreds, in the case of ecosystem) requests from end-users. Instead, we're simply providing them with all the data and allowing them to focus on consuming the data to do whatever work they want to do rather than worrying about _building_ the data, making sure the data is correct, and then consuming it.
Perhaps it would be better for someone to publish a mirror + index somewhere? Perhaps even an npm publish of the data + index + js API could be published to npmjs.com on every travis merge? I personally like the low maintenance overhead of letting distribution mechanisms evolve & innovate outside of the core repo used to store the vulns.
I genuinely agree with your thought on letting distribution mechanisms evolve and innovate outside of the core repo, but _in my opinion_ the data is not currently easy enough to consume where we can achieve that. Introducing a single, low-barrier data source that can be consumed by anyone who is interested in distribution of the data enables us to get to that point, _in my opinion_.
I believe @lirantal was planning on some kind of distribution of the data on npm as a part of the Security WG. I've offered to help work with him on that in addition to the work I'm doing on indices.
@bnb do you have a personal use for this? do you have any concerns about the endless growth of the json index files?
There are many ways I'd like to start using this data if it lands:
engines in package.json (used by many cloud providers, like Heroku and GCP).nvmrc (nvm, Travis, other CI systems)Dockerfile (used by Docker on deployments – according to Docker Hub, 10m+ downloads but I believe we're over hundreds of millions).node-version (used by nvs, avn, nodenv)Addressing endless growth of JSON files... I don't particularly have concerns about that because it's just mirroring the same amount of JSON that exists in the directories, just in a different (more approachable) format. If we're concerned about data size, then perhaps using JSON to store things at all is a mistake and we should change how we're approaching this from the level of individual pieces of data.
I'm not concerned from an end-user perspective because I would imagine that people would consume the source and store it in the shape that they're looking to consume it, rather than always using the source itself.
Additionally, I'll add I think it's a mistake to _only_ go down the path of encouraging publishing this data to npm. There is a massive ecosystem of engineers who could use this data, and forcing them to use a module published to npm is not a universally acceptable approach.
So, I'm not objecting, definitely not on size grounds, and particularly given that you are volunteering to do the work, but this is a key point, IMO:
Build APIs that consume this data and whittle it down to concise, useful data for those working on tools/platforms that use the Node.js runtime.
The indices proposed are not enough to support the use-cases, you plan to build APIs on top of the indices, and once you have _them_, the rest becomes easy (or probably just "easier").
Index or not, this is the key step, and it can be done without the indices.
Is it some amount easier if you have direct access to put scripts into Travis to pregenerate some of the backing data for those APIs? For sure, but its just one step to the APIs that are needed.
Addressing endless growth of JSON files... I don't particularly have concerns about that because it's just mirroring the same amount of JSON that exists in the directories, just in a different (more approachable) format.
That isn't actually my understanding of how git stores content. git commits are SHAs of the contents, a new commit with a new file is a new Directory object, but it just contains the SHAs of the previous existing File objects. If those files have not changed, the pre-existing file contents will only need to be present in the git history once. With all the json in one file, every commit will cause a duplicate copy of the entire json contents of the repo + a little bit of new data to be stored in a completely new File object. So, its not that the repo becomes a steady 2xlarger with the index, its that the repo size will grow at a faster rate than the raw vuln data being added.
So, maybe the text files are small enough, and text/json is compressible enough, that this won't be an issue, but it makes me nervous. I guess I could do a POC to estimate size growth over time, but even though I'm a bit concerned about the long-term viability of this storage format, I'm prepared to break the format (again) later if it becomes a problem.
The indices proposed are not enough to support the use-cases, you plan to build APIs on top of the indices, and once you have them, the rest becomes easy (or probably just "easier").
I'm not really sure what the assertion/intent is here 🤔
Index or not, this is the key step, and it can be done without the indices.
As I've said, I've worked on the only ecosystem project (AFAIK) that has done this. It took more time to build a bespoke solution to surfacing this data than it took to build tools inside the project to use it. The time investment for community members is in most cases likely not worth doing this, especially not when _we_ could take this step for them while leaving the option of rolling up a bespoke solution open.
Is it some amount easier if you have direct access to put scripts into Travis to pregenerate some of the backing data for those APIs? For sure, but its just one step to the APIs that are needed.
Making just one step is a win in my experience! We should be chiseling away at the barriers to using community-owned security data however possible. Ensuring there are multiple avenues that put the community one step forward is a win, IMO.
So, its not that the repo becomes a steady 2xlarger with the index, its that the repo size will grow at a faster rate than the raw vuln data being added.
For what it's worth, my perception of the concern around size has been focused on the size of the JSON file, not the size of the git repository – apologies for that! I'm totally unfamiliar with what kind of issues can arise with large git repos, so am totally open to feedback on that.
I'm prepared to break the format (again) later if it becomes a problem.
I am as well.
OK, give the indexing a shot, then, lets see how it goes. EDIT: not to imply I'm the one to decide this, but I don't see any objections, so it seems likely to get approved.
Most helpful comment
OK, give the indexing a shot, then, lets see how it goes. EDIT: not to imply I'm the one to decide this, but I don't see any objections, so it seems likely to get approved.