Warehouse: GitHub Token Scanning Partnership

Created on 19 Jun 2019 · 18Comments · Source: pypa/warehouse

What's the problem this feature will solve?

GitHub has a "Token Scanning" feature, which enables them to detect API keys/tokens, and notify the service provider who issued the token. The service providers can then act on this information according to their own policies (revoke, contact owner etc).

This helps avoid situations where an active API key is available publicly on the project's repository. GitHub has a Partnership program for this, with various service providers (incl. npm and more).

What's the problem this feature will solve?

It would be worthwhile to explore partnering with GitHub for Token Scanning, when the support for API Keys has been completed.

Additional context

https://developer.github.com/partnerships/token-scanning/ contains details about how to contact GitHub for this partnership, and details on how the integration works.

feature request help needed

Source

pradyunsg

❤2

All 18 comments

Thanks @pradyunsg! (blocked on #994)

brainwane on 19 Jun 2019

😄1

This reminds me of https://python-security.readthedocs.io/pypi-vuln/index-2017-11-08-pypirc_exposure_on_github.html .

ewjoachim on 20 Jul 2019

The next steps necessary here are to implement the endpoints on PyPI for the token alert service including signature verification and optionally revocation and notification.

ewdurbin on 16 Dec 2019

I might be interested, as a first contribution. I don't want to block it though, so if anyone passes by and sees this, don't consider it's already taken before I post a PR.

ewjoachim on 17 Dec 2019

Documentation is missing a few elements, especially regarding cryptographic signature (I'd like to get a working example of key cert, signature and payload to make sure we're doing things right).

I'm thinking about contacting the [email protected] email. Who should I Cc ? @ewdurbin and @pradyunsg ? (Ideally, GitHub being GitHub, they may agree to hold the discussion in this ticket.)

ewjoachim on 22 Dec 2019

@ewdurbin might know if there's already been contact with GitHub on this front.

Happy to be cc'ed in whatever contact we're having with GitHub on this, if folks reckon that's a good idea. 😁

pradyunsg on 22 Dec 2019

@ewjoachim you're welcome to CC me: [email protected]

ewdurbin on 23 Dec 2019

Ok, mail sent. Here are the questions I've asked in it.

The documentation doesn't really specify how the signature is made. The code examples mention the different technologies used, but even if I do some research on the default parameters of the functions used in the samples, I'm afraid I will not be able to verify my assertions without a working example. It would be much easier if we could have access to a sample x509 key, a payload and its signature generated and encoded the same way the real ones will be, so that we could make sure our decryption functions really work in the integration tests. If you already have a working example using Python, of course, that would be best, but once we have an example payload, I'm confident we'll be able to do the rest.
The 2 code samples have a 64-character-long key id, but the payload sample has a 42-character-long key id (that is the 42 first characters of the 64 chars key), and if I understand correctly, they're expected to match precisely. Is there something I haven't understood in the sample (I'm not very familiar with either Go or Ruby) ?
On this subject, the documentation is a bit unclear regarding how keys should be handled:
Should we keep a copy of the key id in the code, make a 3 way comparison that "key id in the code == key id in the payload == key id fetched in the meta api" ? If yes, should we also keep the key (and not just the key id) in the code, or should we trust the one in the API and check that its fingerprint corresponds to our key id ?
Or should we just compare "key in the payload == key fetched in the meta api" ?
Or should we do that but also store the key in a trusted cache of ours or in memory (persisting between request/response cycles) ? If so, how long would be a reasonable cache time ?

And in both cases, how should we handle "is_current" ? I understand it as a way to gracefully rotate keys on your side, the most recent one being the "current" one, with the idea being that when a new key is added (how often should we expect that ?), the old key will still be available (for how long ?) if we need to validate old payloads. In this case, if the key is stored in the code the "is_current" attribute tells us to warn loudly because it's about to break.
If the key is not stored in the code but either cached or not stored at all, can you confirm that the "is_current" attribute will not be of any use to us?

I understand that how often the endpoint will be called on our side depends on how often people will commit strings matching the regex we'll register, but the payload seems to support batch calls, and I'm interested in the batch factors. Is it one call per file ? Per commit ? Per push ? Per time amount (minute ? Hour ? Day ?)

Github folks, if you read this and it's ok with you, please feel free to answer here rather than by email, this way the discussion stays public and the details we share may be useful to the whole community :)

ewjoachim on 26 Dec 2019

👍1

Here are the answers from GitHub. Everything following is a quote.

The documentation doesn't really specify how the signature is made.

The signature is a ECDSA-NIST-P256V1-SHA256 signature. I’m not super familiar with python’s elliptic curve libraries. But, googling, turns up https://pypi.org/project/ecdsa/ as a seemingly popular one. I _think_ you would do something like:

vk = VerifyingKey.from_pem(key_from_meta_api)
vk.verify(signature_from_webhook, message_signed_from_webhook)

I agree that a working example in the docs would be useful. Very early on this was a working example (i.e. before this API was public). It should be updated to work with the current keys used in production. For now, I pulled sample payload/signature/validation I had lying around. I’m on mobile right now, so haven’t double checked it, but this should be useful. It is in Ruby, but hopefully it provides reasonable guidance/let’s you double-check your python implementation:

See implementation Gist: https://gist.github.com/ewjoachim/7dde11c31d9686ed6b4431c3ca166da2

The 2 code samples have a 64-character-long key id, but the payload sample has a 42-character-long key id (that is the 42 first characters of the 64 chars key), and if

That looks like a typo..something go truncated somewhere.

On this subject, the documentation is a bit unclear regarding how keys should be handled

There isn’t a singular “correct” answer to this, but will go through your questions below:

Should we keep a copy of the key id in the code, make a 3 way comparison that "key id in the code == key id in the payload == key id fetched in the meta api" ? If yes, should we also keep the key (and not just the key id) in the code, or should we trust the one in the API and check that its fingerprint corresponds to our key id ?

It is expected you would trust the one from the API. . As such, you aren’t expected to keep a key in code.

Or should we just compare "key in the payload == key fetched in the meta api" ?

Yes, you would use the key ID from the payload to look for the associated public key from the meta API.

Or should we do that but also store the key in a trusted cache of ours or in memory (persisting between request/response cycles) ? If so, how long would be a reasonable cache time ?

yes, a cache would be advisable to prevent having to fetch the key to validate every webhook. The cache lifetime is ultimately a trade off on performance and security, but there isn’t a singular correct answer. We will return the set of public keys that are valid and can/should be used for signature validation. If a key is removed from this endpoint it shouldn’t be used for validation going forward. For token scanning, the performance vs. security tradeoff isn’t (likely) that consequential. We added signatures to the webhook payloads to allow folks like yourselves to be able to differentiate hooks sent from GitHub from random noise on the Internet. But, generally speaking, there isn’t much risk associated with these payloads. From this perspective, for most folks, it is fine to give cache longer than not (tens of minutes, an hour, etc).

And in both cases, how should we handle "is_current" ? I understand it as a way to gracefully rotate keys on your side, the most recent one being the "current" one, with the idea being that when a new key is added (how often should we expect that ?), the old key will still be available (for how long ?) if we need to validate old payloads.

As implemented at the moment, this endpoint will return up to 10 keys (current plus nine prior). However, that is not an implementation detail to rely on, as we will likely change this as we look to eventually rotate keys as a matter of operational practice. Generally speaking we only really intend for webhooks to be validated when they are sent. As such, the meta API endpoint will likely eventually return the current key and one prior key (to cleanly handle rotation).

In this case, if the key is stored in the code the "is_current" attribute tells us to warn loudly because it's about to break.

Yes, this is definitely a reason for not storing the key in code

If the key is not stored in the code but either cached or not stored at all, can you confirm that the "is_current" attribute will not be of any use to us?

Generally I would say this is the case. This “is_current” attribute was added relatively early on and should probably be revisited going forward.

I understand that how often the endpoint will be called on our side depends on how often people will commit strings matching the regex we'll register, but the payload seems to support batch calls, and I'm interested in the batch factors. Is it one call per file ? Per commit ? Per push ? Per time amount (minute ? Hour ? Day ?)

The hooks are sent once per “repository event” (not an official term). Generally speaking these events include:

A push (can include one commit from one branch, or 100 commits across 10 branches)
A web edit (i.e. a singular commit to a single branch)
When a private repository is turned public (triggers a full repository history scan)

ewjoachim on 8 Jan 2020

I can work on this issue

sanjaysiddhanti on 10 Jul 2020

I already have a working PR, but you’re welcome to review it. It’s been sitting there for 6 months and I’m not sure what it needs to go forward.

https://github.com/pypa/warehouse/pull/7124

That being said, if you want to make your own version, you can, but... it might be more interesting to put hours into work that hasn’t already been done :)

ewjoachim on 11 Jul 2020

Thanks @ewjoachim, I had missed that entirely. I'll give it a review and see if we can move it forward.

di on 13 Jul 2020

🎉1

FYI @ewjoachim, #7124 was reverted in #8555 due to #8554.

di on 15 Sep 2020

Ok. Is there any material to help understand where the bug was ?

ewjoachim on 15 Sep 2020

Got it.

https://github.com/pypa/warehouse/pull/8562 < fixes the code that caused the revert
https://github.com/pypa/warehouse/pull/8563 < reintroduces the rest of the code (is there anything I should be doing to avoid another revert ? I thought I had done everything right and it seems I didn't)
https://github.com/pypa/warehouse/pull/8561 < related small problem but orthogonal to the changes

ewjoachim on 15 Sep 2020

❤1

Sorry, was sleeping 🙂 That looks right, once the comments on #8562 are resolved I think both can be re-merged.

di on 15 Sep 2020

❤1

Don’t be sorry, still being able to sleep in 2020 is an achievement of its own ;)

ewjoachim on 16 Sep 2020

😄1

@ewjoachim

is there anything I should be doing to avoid another revert ?

The generic answer would be to use feature-flags + distill atomic parts of changes into separate PRs (as many as possible) so that they'd be accepted gradually (it also makes the main PR smaller and easier to spot problems...)