dvc: consider switching from GitPython

Created on 2 Jul 2019  路  30Comments  路  Source: iterative/dvc

GitPython causes constant headache and considerable time loss investigating and fixing issues on windows with file not closed and processes not terminated soon enough. We have several places in our code where we were forced to use retries to handle that.

The notable alternatives are pygit2 and Dulwich.

Does anyone know of any downsides of any of these? Or other alternatives? Or maybe there is a good reason we are using GitPython, which I am unaware of?

c13-half-a-week p3-nice-to-have refactoring

Most helpful comment

I really enjoy using the pygit2 interface.

All 30 comments

Also, looks like gitpython depends on git being installed:

GitPython needs the git executable to be installed on the system and available in your PATH for most operations. If it is not in your PATH, you can help GitPython find it by setting the GIT_PYTHON_GIT_EXECUTABLE=<path/to/git> environment variable.

Dulwich is pure-python. I wonder if we can do git clone with it as well? It would be great if dvc does not require git at all in certain cases like get/import-url.

We had some troubles with Dulwich trying to compile its C extension or something. @efiop has better idea and even prepared a patch for Dulwich. I'm not sure why we are not just specifying the global option to compile it w/o C-extension.

Dulwich doesn't have wheels yet and to install it in pure-python form, we need to explicitly specify special flag to pip install, which can not be included into install_requires. That is why I've sent a patch that would fallbackt to pure-python if it is unable to complie C extensions, but that patch got declined, because maintainer didn't want people accidentally installing slower dulwich version. So I've helped a bit with automatic wheel building, but there is still stuff to do (e.g. need to fix tests on mac). Other than that, dulwich seems pretty promising.

pygit2 requires libgit2, but it has a pretty good wheel selection already https://pypi.org/project/pygit2/#files , so also might be promising.

We chose to use gitpython because it just worked and is the most popular project among these :) With more advanced use it does has a lot of drawbacks indeed.

I really enjoy using the pygit2 interface.

I am also in favor of pygit2. It's maintained by the same people maintaining libgit2, so it should be future proof and not do anything weird.

It looks like it's getting more and more important. A few people mentioned partial checkout problems with gitpython. And it's very important to solve get/dvc API dependency on command line and local filesystem. @efiop can we prioritize the initial research around pygit2?

Sure @shcheklein , done.

We might reconsider using pygit2, there are lots of issues with it:

  • it's a beta
  • not actively developed
  • issues get answers in a week or several weeks, many times they don't
  • it's lower level than gitpython, see this for example
  • it unlike gitpython does not match git calls, I mean if someone is able do git ... do something then it might not work when dvc does that via pygit. I.e. cloning with https or ssh url might work cmd-line, but not via pygit2 depending on libraries installed or some configuration

GitPython for comparison:

  • not actively developed
  • issues get triaged fast, not fixed - "help wanted" label is assigned
  • PRs get merged regularly

P.S. Dulwich is somewhere inbetween.

@Suor Dulwich - is there active development, at least? can you confirm that it does not depend on command line? If we decide to be involved in one those projects we need to pick one that is more or less active and we can be sure that we'll be able to release new versions, build wheels, etc. And it should be aligned with our long term goals - things like no dependency on CLI is important, for example.

@shcheklein Dulwich is not actively developed, none of the three is. Support is also so so, I would not call it reliable.

Why do you want to get away from CLI? I see using that as an advantage, this makes things more reliable and easier to set up for users, since they already did that. The only downsides I see is speed (presumably, and we need benching to really say this is an issue) and processes on windows, which we worked around so far.

Because it's extremely strange to have git installed as a dependency for an API to work. I've seen already some push back on this.

I see. But for everything else it's an advantage.

If we go pygit2 way we will need to own/develop/release it, I see zero chances it will work other way. We will also need to provide custom wheels to include libssh and OpenSSL and their counterparts on other systems to make it work with https and ssh urls, otherwise users will need to install those manually. We will need add lots of code to dvc to support many different use cases and we will get lots of bugs also.

If we go Dulwich way we most probably will also need to support it. If it turns out to be slow - it is mostly pure python - then we will need to rewrite its parts in C or Cython, there are some already.

k, how about we use both then? For api we need some limited functionality - only fetch the bunlde of files that belong to a revision. Can we do this somehow with some other library?

I guess with some hack we may use pygit2 or Dulwich just for API.

@Suor what kind of hacks do you see? What else do we need from them to start using? How stable the "clone" part for them?

@shcheklein looked through the code and it looks like simply changing calls in external_repo() implementation to clone and checkout with different lib would do.

@Suor is there a function/command in Git to only checkout, I wonder? Anyway, sounds like a good workaround then. But it looks like we will need to find a way to support new stuff in GitPython. So, let's create a ticket for fixing GriPython + a ticket for fixing Dulwich and use it in API.

@shcheklein what do you mean by "only checkout", isn't git checkout ?

@mroutis just download the workspace that corresponds to a revision as a tar bundle for example. Don't download .git, etc. Use simple http protocol, etc. I'm not sure if git server supports stuff like this, but it would be handy for our API.

@shcheklein looks like it does, but GitHub doesn't allow that transaction https://twitter.com/GitHubHelp/status/322818593748303873

https://www.gilesorr.com/blog/git-archive-github.html

So there is an issue with switching API to Dulwich, because we really switch external repo implementation, which is also used by dvc import/update, so we will need to install Dulwich for everyone with all its issues, e.g. not having wheels.

We might use pygit2 instead, but that is not able to do clone(depth=1) at the moment. In fact it's not supported by libgit2 itself, this have been hanging for years there.

So diagnosis on this: neither Dulwich not pygit2 are ready to be used instead of GitPython, reasons are lacking support and functionality. Using any of them for API/get only is complicated by the fact that same code path is used by dvc import, so we will always require those

Functionality around API/get/import that is lacking:

  • pygit2:

    • clone(depth=1) - not supported by libgit2, a feature request is hanging for years and blocked by no support for shallow repos at all
    • can't list tags - there is a workaround, which doesn't work itself
    • https and ssh require extra libs
    • auth requires extra setup
  • Dulwich

    • checkout implementation is incomplete: it doesn't remove files. Could be worked around by cloning without checkout and then checking out into an empty dir.
    • doesn't have wheels for any platform
    • breaks on install unless --pure flag is used, which can't be used in setup requires, but only in requirements.txt, it also prevents pip from using wheels for everything else

Since we have shared code path between dvc cli and api all the dvc cli users will need to install Dulwich/pygit2 if we switch that path to any of them. Both have installation issues.

My recommendation: Use Dulwich for API/get/import only, set up wheels for it, change our code after that, use Dulwich as simple dependency along with GitPython.

Long term takeaway. Any future transit to Dulwich/pygit2 will need to consider that:

  • Dulwich

    • not actively developed,

    • support is triage/ask for help/merge PRs,

    • base is low-level, high-level part is incomplete

  • pygit2

    • not actively developed,

    • support is basically absent,

    • lower level than GitPython,

    • lacking both high and low level functionality

  • libgit2

    • developed very slowly,

    • some support,

    • it's C

In my opinion the only way we can use any of those instead of GitPython is by maintaining it.

@Suor , @shcheklein what about git from command line, is this too much? For the SCM tree make sense to use a library but what about other operations like clone that are more "high level"?

@Suor , what do we need from the library itself?

  • Traverse the .git directory (for SCM tree?)
  • add
  • commit
  • clone
  • Ignore files from tracking

Am I missing something?

Agreed on using dulwich for get to clone and checkout(with a workaround) the tree and so we need to setup macos wheel builds(need to make tests pass on mac first).

  • [x] fix mac tests
  • [x] setup wheel building for mac(much of it was already submitted by me earlier, but the guy just didn't release anything since february)
  • [ ] use dulwich to clone and checkout in dvc get

Also might want to discuss ^ with dulwich maintainer(note for myself).

dulwich now ships with all the wheels https://pypi.org/project/dulwich/#files , we could give it another try now.

A combination of pygit2 + dulwich to compensate each other's issues/lack of features could be possible too.

A combination of pygit2 + dulwich to compensate each other's issues/lack of features could be possible too.

Sounds like a new chapter in our git adventures and a buumpy one ;)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tc-ying picture tc-ying  路  3Comments

robguinness picture robguinness  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

dmpetrov picture dmpetrov  路  3Comments

mdscruggs picture mdscruggs  路  3Comments