Allennlp: Split AllenNLP into pieces

Created on 24 Jun 2019 · 13Comments · Source: allenai/allennlp

I'm planning on talking about this in more detail when we have our summit in a couple of months, but I figured I might as well write down my thoughts here first, so that anyone who wants can comment on them (especially people who aren't going to be at the summit).

We've struggled a lot with how to manage contributions to the library, and with having a huge pile of dependencies that come from lots of individual models that the rest of the library doesn't need. We don't really want to add a large maintenance burden by accepting a whole lot of other people's models, but we _do_ want to highlight and give visibility to good code that others write.

I think we can solve these issues by splitting this repository into several sub-repositories, and adding an add-plugin command that adds a package that gets loaded at every call to import allennlp.

The add-plugin command works by adding a line for the package to .allennlp/plugins, which gets checked at the end of allennlp.__init__.py, doing an import(package) on everything listed there. This makes it so you don't have to add --include-package at every call to an allennlp command, so you can get the registered models / dataset readers / etc. easier. We might need something tricky to make sure you can add commands this way, but it shouldn't be too hard. We could also make add-plugin do a pip install on the package itself, if it hasn't already been done.

I'm imagining something like the following repositories:

allennlp (contains the core abstractions - Trainer, most of data except dataset_readers; all of the abstract classes you need to run a trainer, and most or all of the commands)
allennlp-modules (contains most of what's in allennlp.modules, though some or even all of it could stay in the main repo - I could go either way on this one)
allennlp-server (contains the serve and configure commands)
allennlp-reading-comprehension (contains the models, dataset readers, and utilities related to reading comprehension)
allennlp-seq2seq (contains seq2seq models, including what's in #2097 and #2913)
allennlp-semparse (contains the state machine and semparse modules, and related models and dataset readers)
and so on (possibly including a vision-specific plugin, as in #2099)

On the main README, we describe this architecture and point to a list of plugins / model collections, including ones written by other people (which gives us a way to accept large contributions without adding to our maintenance burden - tell them to make their repo pip installable, and add it to the list of plugins).

Additional benefits

CI should get much faster for each individual piece here
We would be dogfooding the contribution process that we recommend others take
Each repo could have its own standards for formatting / CI
Splitting things out like this gives a natural time to add an auto-formatter, if that's what we want
(Minor, but non-trivial) My IDE is a lot snappier when the package I'm developing in is not so huge; splitting allennlp up would reduce not just cognitive load when looking at the library, but also IDE load.
It's a lot easier / less scary for us to add maintainers to the individual repositories. E.g., @sai-prasanna, who's done a lot of work on the seq2seq abstractions, could be a maintainer of the new allennlp-seq2seq repo.

Concerns

It's harder for CI to check for the implications of changes to the main library on downstream code, because that code is in another repository. We could mitigate this by having CI for the main repo do some integration testing with other repos, if we wanted to.
Releases and compatibility between these repos become more of a pain. Though, if we are strongly committed to keeping the API stable, this shouldn't be _too_ much of a problem - most of the plugin repos probably won't need releases much at all.

Source

matt-gardner

👍10

Most helpful comment

My 2 cents:

I don't think having an add-plugin command is the best way to manage registerable things across packages. A couple of alternatives:

Try except blocks in allennlp/__init__.py which try to import the packages.
A post_install script in each of the individual packages which updates some .allennlp/plugins as you suggest

Secondly, I don't think we need to commit to splitting out everything - I think the semparse and state_machines packages (and associated dataset_readers and models) would be great candidates to split out as a first step into allennlp-semparse. We should try this out and see how it goes before going all in and splitting out things like modules. This is a great first package to split out, because 32% of our slowest 40 tests are from the semantic parsing modules or packages.

I like the idea of the plugin making it easy to use other people's repos if they make them pip installable.

DeNeutoy on 25 Jun 2019

👍2

All 13 comments

Thanks a lot for this project! It really cut down amount of boilerplate I had to write for experiments.

My feedback:

As a library user, I'd still want ease of use and detailed tutorials and examples in one obvious place rather than spread across many repos.
I'd like to have one Python package to install so I don't need to think about picking the right set of dependencies that work with each other. e.g. something like allennlp-all or, alternatively, just keep allennlp as the name of the umbrella package to avoid confusion from the existing users.
Also it would be good if existing tutorials and examples in blog articles kept working. I guess it's not a priority but they helped me to start using AllenNLP.

shtratos on 24 Jun 2019

Hmm, yeah, that's not a bad idea - use allennlp-core if you just want the bare-bones stuff, and allennlp includes a default set of plugins.

None of this should change any of the tutorials or other things - the things that you import in those tutorials are the things that would stay in the allennlp-core package, without modifying any code. The models and dataset readers would probably move to be in different packages, unless we did some crazy python magic.

We're also working on an allennlp "course" with more and better introductory material. No concrete timeline on when it'll be released yet, other than that we've committed internally to release an initial version by the end of September at the latest.

matt-gardner on 24 Jun 2019

👍2

My 2 cents:

I don't think having an add-plugin command is the best way to manage registerable things across packages. A couple of alternatives:

Try except blocks in allennlp/__init__.py which try to import the packages.
A post_install script in each of the individual packages which updates some .allennlp/plugins as you suggest

I like the idea of the plugin making it easy to use other people's repos if they make them pip installable.

DeNeutoy on 25 Jun 2019

👍2

@matt-gardner, thanks for outlining your thoughts here. While I like the idea of having user models outside the main repo and accessible via add-plugin (or similar) I'm worried about fragmenting the core library. Essentially I think your concerns section is spot on. Maintenance across repos can be a big pain. While it's natural for different research projects to have their own repos (semantic parsing, salience, etc.) you seem to be suggesting we go further.

To be more specific, allennlp.modules is broadly applicable. I'm struggling to think why it wouldn't be in the core library. There are certainly dataset readers that we are meant to be used across projects too like InterleavingDatasetReader, MultiprocessDatasetReader, SequenceTaggingDatasetReader, etc.

@DeNeutoy 's point about splitting out some candidates first sounds like precisely the right way to start addressing this issue.

brendan-ai2 on 26 Jun 2019

👍1

Research projects already have their own repos, but we still have all kinds of models in the main library. The semantic parsing code isn't a research project, it's a sub-library for doing research projects. My main point is that we should probably not have _any_ models in the main repository - this simplifies dependencies and gives clarity to "should we accept this model"? The answer is always "no, it belongs somewhere else."

The server is another place where we add a lot of dependencies, which many users probably don't need.

To be clear, these are the modules that I would keep in allennlp-core:

commands (except configure, which requires a server)
common
data, except data.dataset_readers and some of the niche Fields (we keep the base DatasetReader, and possibly the Interleaving and Multiprocess readers, but not the tagging reader, which goes in an allennlp-tagging repo)
most of modules (there are some niche ones that can be moved to an appropriate sub-repo, but most can stay)
nn, minus beam_search.py and chu_liu_edmonds.py, which go to the seq2seq and parsing repos
tests
a few things in tools, but most get moved
training

pretrained.py gets moved into an allennlp-pretrained repo that depends on allennlp-core and a bunch of the other repos (and is included in the main allennlp pip package).

Another benefit of doing things this way: I've said a few times in the past the some parts of the API to me feel more important to keep stable than other parts. Splitting things out makes clear which things we are committing to keeping stable - the APIs in allennlp-core won't change, but things in other repos might not have as firm commitments to backwards compatibility. This way no one has to guess about whether the deprecation policy applies to any particular change.

matt-gardner on 26 Jun 2019

Not having any models in the main repository seems strange to me. As a user trying to understand a library the first thing I do is look for examples of the abstractions in use. Having the actual models be somewhere else would be confusing and slow me down. This is not user friendly. Granted, I'm happy to move specialized models elsewhere, but having example models and suitably general ones (like seq2seq) in the main library strikes me as useful. Having clarity to the "should we accept this model" question is nice, but we give up much for that clarity. Instead we could simply say, "We generally don't accept new models. Exceptions are made at our discretion for models that are suitably general and useful."

Thank you for describing allennlp-core in more detail! I think I broadly understand your logic. My concerns at present aren't so much that I disagree with the fine-grained decisions of what to include/exclude, but rather that -- for a project of such size -- I'd like be careful with the process we use. In particular:

1) We should describe and evaluate requirements that we're attempting to satisfy. For instance, "Speed up our CI to run faster than n minutes".
2) We should consider alternative solutions for those requirements. "Put allennlp-teamcity on a bigger box."
3) We should carefully weigh the costs and benefits of the proposed solutions.

brendan-ai2 on 26 Jun 2019

I think "suitably general and useful" is impossible to define and boils down to research taste. E.g., I can think of more papers that used our semantic parsing framework than used our seq2seq model. I don't think we should be in the business of trying to define what "suitably general and useful" is, especially because it's easy enough to separate things out so that we don't have to.

The goals of this are the following:

Make a light-weight core installation for people who don't need the huge mass of models and other dependencies that we currently have. Lots of people have complained about how difficult it is to install allennlp.
Finally resolve the tension inherent in "contributed models", "model gallery", and so on. We just don't have any "contributed models"; everything (including models that we build) is a pointer to another repository. The core library's job is to define the abstractions and the building blocks, not the buildings themselves.
Have clearer messaging around development and external contributions. With things separated into pieces, it's a lot easier to talk about which parts are stable and always backwards compatible, what the roadmap for each piece is, and which external contributors can be made maintainers.
Make it easier to extend allennlp, e.g., with vision APIs or other Registrable things, without having to use a ton of --include-packages everywhere.
Avoid feature / model creep, and keep the core library focused on providing solid, general abstractions.

Faster CI is a side benefit, not a goal. For getting to examples, it's probably _easier_ to have them separated out, because then it's more obvious how you would do development on your own models. And we would have obvious links to them, and a course describing what the abstractions are, how they work, and how to use them to build models for a wide variety of tasks.

matt-gardner on 27 Jun 2019

My claim is precisely that we need not define "suitably general and useful". We're not legislating or trying to write a program that can make this decision. Using discretion does not seem inappropriate here. If your view is that semantic parsing is more broadly useful than seq2seq, I'm quite happy to defer to your expertise and keep it in the main library! My concern is not that initially separating things out is hard. It's that maintaining and developing a library across many repos and many CIs is hard.

Organizations much larger than ours successfully use monorepos. I think it's worth thinking of this issue along a spectrum. A true monorepo lies on one side and a fine-grained partition of our current codebase into many different repos lies on the other. Somewhere between is the option of moving out the more esoteric models (or those with egregious dependencies) while keeping the rest.

brendan-ai2 on 27 Jun 2019

I guess the difference is in what we view as "the library". I think "the library" should refer to the abstractions and the pieces that let you build models, not the models themselves. It makes a lot of things clearer and simpler.

matt-gardner on 27 Jun 2019

Yep, that does appear to be the key difference. Thanks for taking the time to outline your thoughts, Matt. Looking forward to discussing this in person.

brendan-ai2 on 27 Jun 2019

Thanks for pushing on them; it helps to clarify things and get us to end up in a better place overall.

matt-gardner on 27 Jun 2019

Great discussion. Some thoughts:

I concur that multiple repos for one project can be hard to manage. As I understand, the main issue is too many heavyweight dependencies for allennlp package. So I'd entertain the idea of releasing multiple Python packages from the same repo.

In TeamCity it is possible to setup CI build to be triggered on changes in specific paths:
https://www.jetbrains.com/help/teamcity/configuring-vcs-triggers.html#ConfiguringVCSTriggers-VCSTriggerRules
So that can help with building only packages that were changed.

Regarding external contributions, I saw several projects have contrib directory or repo where contributed code lives. That code usually has less support guarantees than the project proper.

shtratos on 28 Jun 2019

Closing in favor of #3302.

matt-gardner on 3 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings