Pip: Differentiating organic vs automated installations

Created on 13 Jun 2018  Â·  20Comments  Â·  Source: pypa/pip

What's the problem this feature will solve?

Currently, pip installation statistics are aggregated to the gCloud and made available on libraries.io and pepy.tech. A lot of effort has gone into these numbers, but thanks to automation, they mean less now than they did a few years ago.

CI and other automation, combined with maybe a bit too much reliance on PyPI's central infrastructure, have inflated the download numbers and diluted the signal with noise.

Describe the solution you'd like

We could detect when pip is being used interactively (by checking if stdin is a tty or some other mechanism), and include that in the pip install request headers, to be included in the statistics generated by the server.

This would provide us with much cleaner data for highlighting actual community activity, instead of drowning in automation trends, overly favoring professionalized sectors of Python. Specifically, a library being manually installed 100 times may well indicate something much more interesting than a CI (or, unfortunately, a production) fleet installing a package 10,000 times.

Additional context

  • I wasn't sure whether to file this on pip or on Warehouse, it seems kind of :chicken: / :egg: to me.
  • I'm not really sure if/how other package indexes solve this, but would be very interested in hearing.
  • As an arbitrary example, I happen to know Mozilla uses PyPI for quite a few relatively-internal packages. Granted, they're open-source and I'm happy to see some infrastructure synergy. But, picking at random, mozlog actually ranks ok for downloads, even though it's not a very broadly-useful package, and I'm pretty sure the data will show it's mostly Mozilla infrastructure downloading it.

Thanks for your attention and keep up the good work!

auto-locked needs discussion enhancement

All 20 comments

Note that because of https://github.com/pypa/linehaul/issues/30, the numbers on Google BigQuery may already be meaningless for answering some questions.

I agree. It would definitely be useful to have separation of automated vs direct usage.

Maybe pypa/packaging-problems would be a good place for it?

@mhsmith like Nathaniel pointed out, the lossage should be fairly uniform, so I think the numbers would still be somewhat representative, if we were collecting them on top of the leaky linehaul, that is :)

@pradyunsg Glad you agree! Given that I suspect (and suggested) a straightforward pip enhancement, I'd like to keep this issue open. That said, I may cross-post this there, if you think it would improve the visibility. Let me know if so!

I think the fundamental problem here is I don't think you can actually detect this reasonably. For instance, if someone manually runs a bash script (or even a tox command), we'd probably want that to be not set as automated-- but by default those things will not have a tty. On the flip side, you have things like Travis CI which I believe mimics a tty, so then Travis CI will look like like a manual install instead of automated.

On a theoretical level, I don't have any problem with the idea-- I just have never been able to think of a good way of actually differentiating the types of uses automatically.

If we want to detect running under CI, I think that's actually fairly easy, because CI systems tend to advertise that fact in the environment. Just checking for "CI" in os.environ or "BUILD_ID" in os.environ or "BUILD_BUILDID" in os.environ would probably catch 95% of cases (including at least Travis-CI, Appveyro, Circle-CI, Jenkins, VSTS).

Or if you want to get fancier, it looks like the ci-info package (2.5 million weekly downloads) has a fairly comprehensive list of envvars to check for: https://github.com/watson/ci-info/blob/master/index.js
(Looks like they're missing VSTS though.)

See https://github.com/The-Compiler/pytest-vw for a Python project that can detect CI.

Yea, it isn't difficult to detect whether you're running in a CI, on most CI services -- or for that matter even which one you're running on. We likely still won't know what %age of the non-CI runs are not automated but having a separation between CI/non-CI is a good start.

I don't know if we'd want to have any distinction between various CI services (logging NULL if we don't have the information, otherwise a string like "travis" representing the service).

I posted #6273 to start addressing this.

I'm going to leave this open for now as opposed to auto-closing for the purposes of discussing whether an additional key-value should be added to store the value of isatty(). The PR that was just merged stored the different info of whether something is known to be running in CI.

FWIW, I pinged on #zuul on Freenode, to see if anyone there has inputs on how to detect running within Zuul. That said, better detection of that is not a blocker in any form.

I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool.

@theacodes if one can set an environment variable, wouldn't setting CI=true achieve just that? Or will that have an impact on other parts of the CI?

Yeah, it might have unintended consequences.

On Mon, Apr 29, 2019, 9:51 AM Mahmoud Hashemi notifications@github.com
wrote:

@theacodes https://github.com/theacodes if one can set an environment
variable, wouldn't setting CI=true achieve just that? Or will that have
an impact on other parts of the CI?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pypa/pip/issues/5499#issuecomment-487657059, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAB5I4565CGSWO5SZCHYFM3PS4RS7ANCNFSM4FEXCP4A
.

I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool.

I think it'd be fine (and low maintenance) to support this. The implementation would just be a matter of adding PIP_IS_CI to the CI_ENVIRONMENT_VARIABLES variable here: https://github.com/pypa/pip/blob/5a00ac4130108605fb1bf51e0e771041664fa33b/src/pip/_internal/download.py#L80

The implementation would just be a matter of adding PIP_IS_CI to the CI_ENVIRONMENT_VARIABLES variable

@theacodes If you could file a PR for this, that'd be great!

Is PIP_IS_CI recommended for non-CI automated installations?
For example, provisioning server via cloud-init or ansible.

Is PIP_IS_CI recommended for non-CI automated installations?

It seems like it should be for any automated runs, but I'm not the one using this data. Is it worth making the environment variable name more descriptive (e.g. PIP_IS_AUTOMATED)? I'm also not sure to what extent this should be publicized / recommended for others to use.

Reflecting a bit more on this, to @methane's implicit point, if we're going to expose an environment variable I'm thinking it would be better to call it something like PIP_IS_AUTOMATED. That would document the intent more clearly.

I think there are several different things we might be trying to track here.

Test vs non-test: installs for testing are "subsidiary" to "real" installs:
they don't directly solve someone's problem; their purpose is just to make
sure things are working for later when someone tries to use the code for
its primary purpose. If you want to count how many installs are intended to
use the code for its primary purpose, then you want to eliminate test
installs. But if someone installs on a big fleet of production boxes,
that's real usage.

Automated versus interactive: if you want to count how many people actually
typed "pip install mypackage", then that's a different question, and
automated installs shouldn't count.

In principle, maybe we should track both of these seperately. More data
allows you to do more :-). In practice, I don't think we have any technical
mechanism to track automated vs interactive installs. Even if everyone on
this thread goes off and manually updates their deployment system to set
some magic envvar, I'm guessing the vast majority of automated installs
won't set that envvar, and that will make the data really hard to
interpret.

A field for "is this running in CI?" is also hard to interpret or connect
to what we really want to know, like how many users our project has. But
it's at least technically feasible, and it's easy to communicate what it
does and doesn't mean to people trying to interpret the data.

So I'm inclined to say, let's just keep it as a CI flag for now. And we can
always revisit once we see the data :-)

On Thu, May 23, 2019, 11:55 Chris Jerdonek notifications@github.com wrote:

Reflecting a bit more on this, to @methane https://github.com/methane's
implicit point, if we're going to expose an environment variable I'm
thinking it would be better to call it something like PIP_IS_AUTOMATED.
That would document the intent more clearly.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pypa/pip/issues/5499?email_source=notifications&email_token=AAEU42ABPK7WMCIKWHB7XGDPW3SDXA5CNFSM4FEXCP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWDFFBI#issuecomment-495342213,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEU42HTHIOUNWCZWQWP25LPW3SDXANCNFSM4FEXCP4A
.

Okay, that's fine with me. And that would mean then that the answer to @methane's original question ("Is PIP_IS_CI recommended for non-CI automated installations?") is no.

Was this page helpful?
0 / 5 - 0 ratings