What's the problem this feature will solve?
GitHub has a "Token Scanning" feature, which enables them to detect API keys/tokens, and notify the service provider who issued the token. The service providers can then act on this information according to their own policies (revoke, contact owner etc).
This helps avoid situations where an active API key is available publicly on the project's repository. GitHub has a Partnership program for this, with various service providers (incl. npm and more).
What's the problem this feature will solve?
It would be worthwhile to explore partnering with GitHub for Token Scanning, when the support for API Keys has been completed.
Additional context
https://developer.github.com/partnerships/token-scanning/ contains details about how to contact GitHub for this partnership, and details on how the integration works.
Thanks @pradyunsg! (blocked on #994)
The next steps necessary here are to implement the endpoints on PyPI for the token alert service including signature verification and optionally revocation and notification.
I might be interested, as a first contribution. I don't want to block it though, so if anyone passes by and sees this, don't consider it's already taken before I post a PR.
Documentation is missing a few elements, especially regarding cryptographic signature (I'd like to get a working example of key cert, signature and payload to make sure we're doing things right).
I'm thinking about contacting the [email protected] email. Who should I Cc ? @ewdurbin and @pradyunsg ? (Ideally, GitHub being GitHub, they may agree to hold the discussion in this ticket.)
@ewdurbin might know if there's already been contact with GitHub on this front.
Happy to be cc'ed in whatever contact we're having with GitHub on this, if folks reckon that's a good idea. đ
@ewjoachim you're welcome to CC me: [email protected]
Ok, mail sent. Here are the questions I've asked in it.
The documentation doesn't really specify how the signature is made. The code examples mention the different technologies used, but even if I do some research on the default parameters of the functions used in the samples, I'm afraid I will not be able to verify my assertions without a working example. It would be much easier if we could have access to a sample x509 key, a payload and its signature generated and encoded the same way the real ones will be, so that we could make sure our decryption functions really work in the integration tests. If you already have a working example using Python, of course, that would be best, but once we have an example payload, I'm confident we'll be able to do the rest.
The 2 code samples have a 64-character-long key id, but the payload sample has a 42-character-long key id (that is the 42 first characters of the 64 chars key), and if I understand correctly, they're expected to match precisely. Is there something I haven't understood in the sample (I'm not very familiar with either Go or Ruby) ?
On this subject, the documentation is a bit unclear regarding how keys should be handled:
And in both cases, how should we handle "is_current" ? I understand it as a way to gracefully rotate keys on your side, the most recent one being the "current" one, with the idea being that when a new key is added (how often should we expect that ?), the old key will still be available (for how long ?) if we need to validate old payloads. In this case, if the key is stored in the code the "is_current" attribute tells us to warn loudly because it's about to break.
If the key is not stored in the code but either cached or not stored at all, can you confirm that the "is_current" attribute will not be of any use to us?
Github folks, if you read this and it's ok with you, please feel free to answer here rather than by email, this way the discussion stays public and the details we share may be useful to the whole community :)
Here are the answers from GitHub. Everything following is a quote.
The documentation doesn't really specify how the signature is made.
The signature is a ECDSA-NIST-P256V1-SHA256 signature. Iâm not super familiar with pythonâs elliptic curve libraries. But, googling, turns up https://pypi.org/project/ecdsa/ as a seemingly popular one. I _think_ you would do something like:
vk = VerifyingKey.from_pem(key_from_meta_api)
vk.verify(signature_from_webhook, message_signed_from_webhook)
I agree that a working example in the docs would be useful. Very early on this was a working example (i.e. before this API was public). It should be updated to work with the current keys used in production. For now, I pulled sample payload/signature/validation I had lying around. Iâm on mobile right now, so havenât double checked it, but this should be useful. It is in Ruby, but hopefully it provides reasonable guidance/letâs you double-check your python implementation:
See implementation Gist: https://gist.github.com/ewjoachim/7dde11c31d9686ed6b4431c3ca166da2
The 2 code samples have a 64-character-long key id, but the payload sample has a 42-character-long key id (that is the 42 first characters of the 64 chars key), and if
That looks like a typo..something go truncated somewhere.
On this subject, the documentation is a bit unclear regarding how keys should be handled
There isnât a singular âcorrectâ answer to this, but will go through your questions below:
- Should we keep a copy of the key id in the code, make a 3 way comparison that "key id in the code == key id in the payload == key id fetched in the meta api" ? If yes, should we also keep the key (and not just the key id) in the code, or should we trust the one in the API and check that its fingerprint corresponds to our key id ?
It is expected you would trust the one from the API. . As such, you arenât expected to keep a key in code.
- Or should we just compare "key in the payload == key fetched in the meta api" ?
Yes, you would use the key ID from the payload to look for the associated public key from the meta API.
- Or should we do that but also store the key in a trusted cache of ours or in memory (persisting between request/response cycles) ? If so, how long would be a reasonable cache time ?
yes, a cache would be advisable to prevent having to fetch the key to validate every webhook. The cache lifetime is ultimately a trade off on performance and security, but there isnât a singular correct answer. We will return the set of public keys that are valid and can/should be used for signature validation. If a key is removed from this endpoint it shouldnât be used for validation going forward. For token scanning, the performance vs. security tradeoff isnât (likely) that consequential. We added signatures to the webhook payloads to allow folks like yourselves to be able to differentiate hooks sent from GitHub from random noise on the Internet. But, generally speaking, there isnât much risk associated with these payloads. From this perspective, for most folks, it is fine to give cache longer than not (tens of minutes, an hour, etc).
And in both cases, how should we handle "is_current" ? I understand it as a way to gracefully rotate keys on your side, the most recent one being the "current" one, with the idea being that when a new key is added (how often should we expect that ?), the old key will still be available (for how long ?) if we need to validate old payloads.
As implemented at the moment, this endpoint will return up to 10 keys (current plus nine prior). However, that is not an implementation detail to rely on, as we will likely change this as we look to eventually rotate keys as a matter of operational practice. Generally speaking we only really intend for webhooks to be validated when they are sent. As such, the meta API endpoint will likely eventually return the current key and one prior key (to cleanly handle rotation).
In this case, if the key is stored in the code the "is_current" attribute tells us to warn loudly because it's about to break.
Yes, this is definitely a reason for not storing the key in code
If the key is not stored in the code but either cached or not stored at all, can you confirm that the "is_current" attribute will not be of any use to us?
Generally I would say this is the case. This âis_currentâ attribute was added relatively early on and should probably be revisited going forward.
- I understand that how often the endpoint will be called on our side depends on how often people will commit strings matching the regex we'll register, but the payload seems to support batch calls, and I'm interested in the batch factors. Is it one call per file ? Per commit ? Per push ? Per time amount (minute ? Hour ? Day ?)
The hooks are sent once per ârepository eventâ (not an official term). Generally speaking these events include:
I can work on this issue
I already have a working PR, but youâre welcome to review it. Itâs been sitting there for 6 months and Iâm not sure what it needs to go forward.
https://github.com/pypa/warehouse/pull/7124
That being said, if you want to make your own version, you can, but... it might be more interesting to put hours into work that hasnât already been done :)
Thanks @ewjoachim, I had missed that entirely. I'll give it a review and see if we can move it forward.
FYI @ewjoachim, #7124 was reverted in #8555 due to #8554.
Ok. Is there any material to help understand where the bug was ?
Got it.
Sorry, was sleeping đ That looks right, once the comments on #8562 are resolved I think both can be re-merged.
Donât be sorry, still being able to sleep in 2020 is an achievement of its own ;)
@ewjoachim
- is there anything I should be doing to avoid another revert ?
The generic answer would be to use feature-flags + distill atomic parts of changes into separate PRs (as many as possible) so that they'd be accepted gradually (it also makes the main PR smaller and easier to spot problems...)