Pants: python deterministic build

Created on 9 Jul 2018  Â·  16Comments  Â·  Source: pantsbuild/pants

When use python_requirements, it will retrieve libriaries dependencies when running ./pants test or binary etc. Then it may retrieve different versions of dependencies in different build machine, which introduced uncertenty.

For python_library, it is ok, otherwise we may bitten by mismatch versions of depedencies from different libraries.
For python_binary, it should produce deterministic build.

Any idea?

Most helpful comment

Ah, I misunderstood, sorry. It should still be possible to work incrementally and start pinning versions of dependencies that you need to stop varying as they occur

As far as I can tell, there is no sane way to know when a transitive-only dep varies. FWICT, the process would go like this:

  1. pin all direct dependencies by editing requirements.txt files slurped up by python_requirements targets as well as any python_requirement_library BUILD targets.
  2. run some pants goal that triggers requirement resolution, in this case binary sounds appropriate, but run with PEX_VERBOSE=5
  3. stare at the output to see which resolved deps are not directly required by the target in question (NB: there is more than staring / grepping to do here! - its not an easy step).
  4. for each of those, add a Pants python_requirement_library target or line to a requirements.txt that pins to the version pex actually resolved on this run for the ranged requirement.
  5. Commit the modification executed in 4. along with the introduction of the python_binary or else when it's requirements change.

For the pants binary itself (src/python/pants/bin:pants_local_binary), enum34 is one example of this sort of hidden dep. It's required by both our requests and pyopenssl requirements (as well as our transitive urllib3 requirement from requests).

Using 3 I find a line:

pex: Fetching file:///home/jsirois/.cache/pants/python_cache/requirements/CPython-2.7.15/enum34-1.1.6-py2-none-any.whl

So I could use this to implement step 4. above. and include the modified (or new) BUILDs / requirements.txt in a commit introducing the python_binary.

It seems to me this is all fairly unreasonable.

What would be more reasonable is for the Pants tasks that produce python binaries or their spiritual equivalent (PythonRun (only operates on python_binary targets), PytestRun (acts on a virtual test binary for a given set of targets) and PythonRepl (for a python_binary target) to emit some form of lock file that records exact version of all dependencies just resolved automatically. This file would then be checked in and Pants requirement resolution would learn to respect these files when present. One candidate would be a constraints file output in the same directory the python_binary target is declared in, perhaps named (<target_name>.)constraints.txt where <target_name>. could be omitted for the default target. This seems like it would work well for python_binary targets which have a natural home in the directory of the declaring BUILD file. For tests, which can glob many targets, this is problematic and I have no bright ideas given the current state of non-individual test-target requirement resolution. That though is fairly unambiguosly a bug (see #5406 and #4723) in v1 and will be fixed by design in v2, so maybe this idea flies.

To be clear, the proposal here results in the workflow:

  1. No need to pin any direct requirements
  2. Whenever a python_binary or python_tests target is "built", check in any new or modified *constraints.txt files.

The other detail to this proposal would be the embedding of a input-requirements hash in the generated *constraints.txt file such that is was not re-generated unless the input hash changed - thus providing stable resolves for any "binary" upon first check-in until such time as direct requirements were changed.

All 16 comments

@shuoli84 : Currently there are a few approaches to solving this.

1) Configure pants to point at a frozen source of your 3rdparty deps (if you use git-lfs, you might point at a directory of wheels in your repository) rather than at PyPI. This is a fairly good solution, and is what we do internally.
2) Fully declaring your 3rdparty dependencies, and linking them to one another so that transitive resolution never kicks in. I'm less sure that this will work well without changes to pex to disable transitive resolution... but that would not be a very hard change.

@stuhood option 2 seems overkill. That means fully declaring dependencies of 3rdparty dependencies.
Does option 1 means if I want to upgrade one library, I have to upgrade it for all python binaries?

Option 2 can usually be approached incrementally, by gradually pinning the library versions you end up needing. You can absolutely have multiple versions of a single dependency, you just can't have a target transitively depend on more than one of those versions. This would mean that if some target required a different version, you could make it depend on a python_requirement_library() target that has a python requirement pointing to the different version (e.g. name the main import tensorflow==1.4.0, potentially in a requirements.txt, but if one of your python targets needed it, you could do:

python_requirement_library(
  name='tensorflow-1.3.0',
  requirements=[
    python_requirement('tensorflow==1.3.0'),
  ],
)

and depend on that target somewhere. See the python 3rdparty docs for more info.

@cosmicexplorer The problem is the dependencies tensorflow depends on. Pinning those requires lots of work.

Ah, I misunderstood, sorry. It should still be possible to work incrementally and start pinning versions of dependencies that you need to stop varying as they occur, which makes it much less upfront work (unless you're having issues with that?). If you needed e.g. two versions of tensorflow, I suspect that you would not need to do the whole pinning of all the transitive deps for each version, but that you would only need to change a few that e.g. tensorflow directly interacts with to support e.g. multiple versions. I am not familiar enough with specifically tensorflow (just thought of it randomly) to say whether this is feasible, however.

Ah, I misunderstood, sorry. It should still be possible to work incrementally and start pinning versions of dependencies that you need to stop varying as they occur

As far as I can tell, there is no sane way to know when a transitive-only dep varies. FWICT, the process would go like this:

  1. pin all direct dependencies by editing requirements.txt files slurped up by python_requirements targets as well as any python_requirement_library BUILD targets.
  2. run some pants goal that triggers requirement resolution, in this case binary sounds appropriate, but run with PEX_VERBOSE=5
  3. stare at the output to see which resolved deps are not directly required by the target in question (NB: there is more than staring / grepping to do here! - its not an easy step).
  4. for each of those, add a Pants python_requirement_library target or line to a requirements.txt that pins to the version pex actually resolved on this run for the ranged requirement.
  5. Commit the modification executed in 4. along with the introduction of the python_binary or else when it's requirements change.

For the pants binary itself (src/python/pants/bin:pants_local_binary), enum34 is one example of this sort of hidden dep. It's required by both our requests and pyopenssl requirements (as well as our transitive urllib3 requirement from requests).

Using 3 I find a line:

pex: Fetching file:///home/jsirois/.cache/pants/python_cache/requirements/CPython-2.7.15/enum34-1.1.6-py2-none-any.whl

So I could use this to implement step 4. above. and include the modified (or new) BUILDs / requirements.txt in a commit introducing the python_binary.

It seems to me this is all fairly unreasonable.

What would be more reasonable is for the Pants tasks that produce python binaries or their spiritual equivalent (PythonRun (only operates on python_binary targets), PytestRun (acts on a virtual test binary for a given set of targets) and PythonRepl (for a python_binary target) to emit some form of lock file that records exact version of all dependencies just resolved automatically. This file would then be checked in and Pants requirement resolution would learn to respect these files when present. One candidate would be a constraints file output in the same directory the python_binary target is declared in, perhaps named (<target_name>.)constraints.txt where <target_name>. could be omitted for the default target. This seems like it would work well for python_binary targets which have a natural home in the directory of the declaring BUILD file. For tests, which can glob many targets, this is problematic and I have no bright ideas given the current state of non-individual test-target requirement resolution. That though is fairly unambiguosly a bug (see #5406 and #4723) in v1 and will be fixed by design in v2, so maybe this idea flies.

To be clear, the proposal here results in the workflow:

  1. No need to pin any direct requirements
  2. Whenever a python_binary or python_tests target is "built", check in any new or modified *constraints.txt files.

The other detail to this proposal would be the embedding of a input-requirements hash in the generated *constraints.txt file such that is was not re-generated unless the input hash changed - thus providing stable resolves for any "binary" upon first check-in until such time as direct requirements were changed.

Another option would involve only supporting pinning via the artifact cache. This would force us to implement artifact caching for the python pipeline (which we don't today) and only users who enable a remote artifact cache would benefit. Several problems here though:

  1. The artifact cache fundamentally is transient. It save work when it can, but it can be blown away, LRU'd, etc. This would wipe out the pinned resolve.
  2. A python binary that gained or lost platforms would invalidate a pinned resolve when it probably shouldn't. We'd likely instead want the versions to stay pinned as-is with just the platform-specific requirements added-removed to be added-removed from the resolve, but with the already established versions. Some handwave here, since the non-existence of a platform-specific probably affects version range solutions.

I think 1 absolutely rules this out, but even if we surmounted these problems the checked in lock/constraints file approach provides visibility that seems desirable. Its also expected in non-Pants workflows, so familiar-ish.

Do like the lockfile solution, which is a common pattern in other languages. Pipenv also embrace this.

Especially given that there is an existing constraints file format, from the context I have this would seem to be a huge productivity boost, especially if we incorporate the input hash and the (target.)constraints.txt naming scheme you described as default (i would assume the target could specify a path to some constraints file as well? perhaps not, actually). I am reviewing the two PRs you linked to get a better understanding of the resolution concern for test targets.

I’m not sure I understand the issue with test targets and producing constraints files for them — you mentioned “globbing many targets,” but I don’t understand what that means / why we can’t do the same naming schema described above (unless you mean that that many files could become cumbersome? a valid point). One way to somewhat ease that might be to allow a constraints file that applies to all targets in the directory — either with a special filename, or something else. This file would then be processed and updated to form a set of constraints that applies for all targets in the directory — and will fail if any have requirement versions that conflict with others. At the risk of getting too ahead of myself, this could also be merged/overridden with target.constraints.txt files for individual targets? Again, might be misunderstanding the issue here.

edit: oh, it seems you were referring to the separate test source/requirement isolation issue. I will read over again, but also the above may be a half answer to that as well.

I also think the analogy to other languages’ lockfiles makes a lot of sense @shuoli84 — I was not familiar with this pattern in python but am looking through the PEP now.

@Eric-Arellano mentioned https://docs.pipenv.org/ after i brought up this (interesting, maybe irrelevant) blog post i found on the slack: https://tech.instacart.com/freezing-pythons-dependency-hell-in-2018-f1076d625241. i will look into whether pipenv could be used similarly to constraints files discussed by @jsirois above, or whether it is overkill.

I have a slightly different idea in mind. It would work for my usecase, but I am not sure if It would break other assumptions. I am happy to hear what you think:

  • python_requirement_library is extended with a constraints list. Whenever a Python test or binary target is created that references the python_requirement_library then constraints are passed to pex via its existing --constraints parameter. This ensures only versions allowed by the constraints can end up getting installed.
  • Extend python_requirements so that it populates the constraints fields of all created python_requirement_library objects with the _entire_ list of requirements as found in the parsed requirements.txt. This implies that when a requirements.txt file is loaded the specified versions also act as constraints for anything originating from the same requirements.txt. If needed, this behaviour can be guarded with a feature toggle.

This would enable the following workflow:

  • I can maintain a file with unpinned requirements requirements.in, eg:

    
    
  • I can use a tool such as the pip-compile command from the pip-tools package to generate a fully-pinned requirements.txt:

    #
    # This file is autogenerated by pip-compile
    # To update, run: 
    #
    #    pip-compile --output-file requirements.txt requirements.in
    #
    backports.functools-lru-cache==1.5  # via matplotlib
    cycler==0.10.0            # via matplotlib
    kiwisolver==1.0.1         # via matplotlib
    matplotlib==2.2.3
    numpy==1.15.3             # via matplotlib
    pyparsing==2.2.2          # via matplotlib
    python-dateutil==2.7.4    # via matplotlib
    pytz==2018.6              # via matplotlib
    six==1.11.0               # via cycler, matplotlib, python-dateutil
    subprocess32==3.5.3       # via matplotlib
    
  • The requirements.txt can be loaded as usual via python_requirements()

  • When I now reference 3rdparty/python:matplotlib as a dependency, it still has the entire pinned requirements.txt attached as a context. This will be passed to pex as a constraints file, thus ensuring I always get a compatible numpy version and nothing else.

@StephanErb slightly different than the idea at the bottom of https://github.com/pantsbuild/pants/issues/6077#issuecomment-403924782 ? It sounds like you and I are on the same page, but it you're intending to highlight a difference from that idea, perhaps spell it out a bit louder for me - I'm getting old and hard of hearing.

@jsirois yeah I think we are mostly on the same page.

What would be more reasonable is for the Pants tasks that produce python binaries or their spiritual equivalent () to emit some form of lock file that records exact version of all dependencies just resolved automatically. This file would then be checked in and Pants requirement resolution would learn to respect these files when present.

This would enable very fine-granular per target locking. In my proposal the lockfile is something more global. A combination of both idea sounds feasible though.

To be clear, the proposal here results in the workflow:

  1. No need to pin any direct requirements
  2. Whenever a python_binary or python_tests target is "built", check in any new or modified *constraints.txt files.

I am wondering how and when would those constraints be updated. I have a usecase in mind where I want to update pandas and numpy globally but keep everything else at its current fixed version (e.g. prevent that requests is updated as well). With the global constraints file I could easily ensure those two are updated in lockstep with a single commit to a single file. With a more distributed per-target approach I might have to touch hundreds of constraints files. If there is an automated invalidation at play, It could also happen that package versions are updated that I did not intend to touch.

Some comments about the approach @StephanErb is suggesting:

  1. requirements.in should also be able to contain partly pinned requirements. E.g. you may want to blacklist certain versions (like pandas!=0.22) if you know that there are bugs, security issues or license violations in a particular version. Furthermore, you may know that the next release of a package (e.g. numpy) is not compatible with your codebase, so you can write things like numpy<1.5.

  2. I would strongly suggest to find another tool than pip-tools for pinning. pip-tools hooks into non-public pip APIs, has a bunch of bugs regarding extras and uses a greedy algorithm that needs a lot of guidance in requirements.in to not take the wrong turn (it's basically constraint solving w/o backtracking). I have experience w/ a large code base that uses requirements.in + pip-tools and I'm not happy about it, especially the UX of the pinning is really bad. I've briefly tested poetry and it seems that their constraint solver is quiet good, also because it fetches wheels from a PyPi-like mirror on-demand and tries to cache a lot of things (so subsequent re-pins might be very fast). Basically, a large issue w/ many python package managers is the constraint/dependency solver. So if you want to do yourself and your users a favor, try to find a well designed system w/ a backtracking solver.

This is now supported through a lockfile with the V2 implementation of Python. See https://pants.readme.io/docs/python-third-party-dependencies#using-a-lockfile-recommended.

Was this page helpful?
0 / 5 - 0 ratings