I'm planning on talking about this in more detail when we have our summit in a couple of months, but I figured I might as well write down my thoughts here first, so that anyone who wants can comment on them (especially people who aren't going to be at the summit).
We've struggled a lot with how to manage contributions to the library, and with having a huge pile of dependencies that come from lots of individual models that the rest of the library doesn't need. We don't really want to add a large maintenance burden by accepting a whole lot of other people's models, but we _do_ want to highlight and give visibility to good code that others write.
I think we can solve these issues by splitting this repository into several sub-repositories, and adding an add-plugin command that adds a package that gets loaded at every call to import allennlp.
The add-plugin command works by adding a line for the package to .allennlp/plugins, which gets checked at the end of allennlp.__init__.py, doing an import(package) on everything listed there. This makes it so you don't have to add --include-package at every call to an allennlp command, so you can get the registered models / dataset readers / etc. easier. We might need something tricky to make sure you can add commands this way, but it shouldn't be too hard. We could also make add-plugin do a pip install on the package itself, if it hasn't already been done.
I'm imagining something like the following repositories:
Trainer, most of data except dataset_readers; all of the abstract classes you need to run a trainer, and most or all of the commands)allennlp.modules, though some or even all of it could stay in the main repo - I could go either way on this one)serve and configure commands)On the main README, we describe this architecture and point to a list of plugins / model collections, including ones written by other people (which gives us a way to accept large contributions without adding to our maintenance burden - tell them to make their repo pip installable, and add it to the list of plugins).
allennlp-seq2seq repo.Thanks a lot for this project! It really cut down amount of boilerplate I had to write for experiments.
My feedback:
allennlp-all or, alternatively, just keep allennlp as the name of the umbrella package to avoid confusion from the existing users.Hmm, yeah, that's not a bad idea - use allennlp-core if you just want the bare-bones stuff, and allennlp includes a default set of plugins.
None of this should change any of the tutorials or other things - the things that you import in those tutorials are the things that would stay in the allennlp-core package, without modifying any code. The models and dataset readers would probably move to be in different packages, unless we did some crazy python magic.
We're also working on an allennlp "course" with more and better introductory material. No concrete timeline on when it'll be released yet, other than that we've committed internally to release an initial version by the end of September at the latest.
My 2 cents:
I don't think having an add-plugin command is the best way to manage registerable things across packages. A couple of alternatives:
allennlp/__init__.py which try to import the packages.post_install script in each of the individual packages which updates some .allennlp/plugins as you suggestSecondly, I don't think we need to commit to splitting out everything - I think the semparse and state_machines packages (and associated dataset_readers and models) would be great candidates to split out as a first step into allennlp-semparse. We should try this out and see how it goes before going all in and splitting out things like modules. This is a great first package to split out, because 32% of our slowest 40 tests are from the semantic parsing modules or packages.
I like the idea of the plugin making it easy to use other people's repos if they make them pip installable.
@matt-gardner, thanks for outlining your thoughts here. While I like the idea of having user models outside the main repo and accessible via add-plugin (or similar) I'm worried about fragmenting the core library. Essentially I think your concerns section is spot on. Maintenance across repos can be a big pain. While it's natural for different research projects to have their own repos (semantic parsing, salience, etc.) you seem to be suggesting we go further.
To be more specific, allennlp.modules is broadly applicable. I'm struggling to think why it wouldn't be in the core library. There are certainly dataset readers that we are meant to be used across projects too like InterleavingDatasetReader, MultiprocessDatasetReader, SequenceTaggingDatasetReader, etc.
@DeNeutoy 's point about splitting out some candidates first sounds like precisely the right way to start addressing this issue.
Research projects already have their own repos, but we still have all kinds of models in the main library. The semantic parsing code isn't a research project, it's a sub-library for doing research projects. My main point is that we should probably not have _any_ models in the main repository - this simplifies dependencies and gives clarity to "should we accept this model"? The answer is always "no, it belongs somewhere else."
The server is another place where we add a lot of dependencies, which many users probably don't need.
To be clear, these are the modules that I would keep in allennlp-core:
commands (except configure, which requires a server)commondata, except data.dataset_readers and some of the niche Fields (we keep the base DatasetReader, and possibly the Interleaving and Multiprocess readers, but not the tagging reader, which goes in an allennlp-tagging repo)modules (there are some niche ones that can be moved to an appropriate sub-repo, but most can stay)nn, minus beam_search.py and chu_liu_edmonds.py, which go to the seq2seq and parsing reposteststools, but most get movedtrainingpretrained.py gets moved into an allennlp-pretrained repo that depends on allennlp-core and a bunch of the other repos (and is included in the main allennlp pip package).
Another benefit of doing things this way: I've said a few times in the past the some parts of the API to me feel more important to keep stable than other parts. Splitting things out makes clear which things we are committing to keeping stable - the APIs in allennlp-core won't change, but things in other repos might not have as firm commitments to backwards compatibility. This way no one has to guess about whether the deprecation policy applies to any particular change.
Not having any models in the main repository seems strange to me. As a user trying to understand a library the first thing I do is look for examples of the abstractions in use. Having the actual models be somewhere else would be confusing and slow me down. This is not user friendly. Granted, I'm happy to move specialized models elsewhere, but having example models and suitably general ones (like seq2seq) in the main library strikes me as useful. Having clarity to the "should we accept this model" question is nice, but we give up much for that clarity. Instead we could simply say, "We generally don't accept new models. Exceptions are made at our discretion for models that are suitably general and useful."
Thank you for describing allennlp-core in more detail! I think I broadly understand your logic. My concerns at present aren't so much that I disagree with the fine-grained decisions of what to include/exclude, but rather that -- for a project of such size -- I'd like be careful with the process we use. In particular:
1) We should describe and evaluate requirements that we're attempting to satisfy. For instance, "Speed up our CI to run faster than n minutes".
2) We should consider alternative solutions for those requirements. "Put allennlp-teamcity on a bigger box."
3) We should carefully weigh the costs and benefits of the proposed solutions.
I think "suitably general and useful" is impossible to define and boils down to research taste. E.g., I can think of more papers that used our semantic parsing framework than used our seq2seq model. I don't think we should be in the business of trying to define what "suitably general and useful" is, especially because it's easy enough to separate things out so that we don't have to.
The goals of this are the following:
Registrable things, without having to use a ton of --include-packages everywhere.Faster CI is a side benefit, not a goal. For getting to examples, it's probably _easier_ to have them separated out, because then it's more obvious how you would do development on your own models. And we would have obvious links to them, and a course describing what the abstractions are, how they work, and how to use them to build models for a wide variety of tasks.
My claim is precisely that we need not define "suitably general and useful". We're not legislating or trying to write a program that can make this decision. Using discretion does not seem inappropriate here. If your view is that semantic parsing is more broadly useful than seq2seq, I'm quite happy to defer to your expertise and keep it in the main library! My concern is not that initially separating things out is hard. It's that maintaining and developing a library across many repos and many CIs is hard.
Organizations much larger than ours successfully use monorepos. I think it's worth thinking of this issue along a spectrum. A true monorepo lies on one side and a fine-grained partition of our current codebase into many different repos lies on the other. Somewhere between is the option of moving out the more esoteric models (or those with egregious dependencies) while keeping the rest.
I guess the difference is in what we view as "the library". I think "the library" should refer to the abstractions and the pieces that let you build models, not the models themselves. It makes a lot of things clearer and simpler.
Yep, that does appear to be the key difference. Thanks for taking the time to outline your thoughts, Matt. Looking forward to discussing this in person.
Thanks for pushing on them; it helps to clarify things and get us to end up in a better place overall.
Great discussion. Some thoughts:
I concur that multiple repos for one project can be hard to manage. As I understand, the main issue is too many heavyweight dependencies for allennlp package. So I'd entertain the idea of releasing multiple Python packages from the same repo.
In TeamCity it is possible to setup CI build to be triggered on changes in specific paths:
https://www.jetbrains.com/help/teamcity/configuring-vcs-triggers.html#ConfiguringVCSTriggers-VCSTriggerRules
So that can help with building only packages that were changed.
Regarding external contributions, I saw several projects have contrib directory or repo where contributed code lives. That code usually has less support guarantees than the project proper.
Closing in favor of #3302.
Most helpful comment
My 2 cents:
I don't think having an
add-plugincommand is the best way to manage registerable things across packages. A couple of alternatives:allennlp/__init__.pywhich try to import the packages.post_installscript in each of the individual packages which updates some.allennlp/pluginsas you suggestSecondly, I don't think we need to commit to splitting out everything - I think the
semparseandstate_machinespackages (and associated dataset_readers and models) would be great candidates to split out as a first step intoallennlp-semparse. We should try this out and see how it goes before going all in and splitting out things like modules. This is a great first package to split out, because 32% of our slowest 40 tests are from the semantic parsing modules or packages.I like the idea of the plugin making it easy to use other people's repos if they make them pip installable.