Spacy: Adding Universal Language Model Fine-tuning ULMFiT pre-trained LM to spacy and alowing a simple way to train new models

Created on 18 May 2018 · 10Comments · Source: explosion/spaCy

Feature description

Universal Language Model Fine-tuning for Text Classification presented a novel method to fine tune a pre-trained universal language model to a particular classification task which achieved beyond state-of-the art (18-24% reduction in error rate) on multiple benchmark text classification tasks. The fine tuning requires very few examples (100) to achieve very good results.

Here is an excerpt of the abstract which provides a good TL;DR of the paper (duh):

Inductive transfer learning has greatly impacted
computer vision, but existing approaches
in NLP still require task-specific
modifications and training from scratch.
We propose Universal Language Model
Fine-tuning (ULMFiT), an effective transfer
learning method that can be applied to
any task in NLP, and introduce techniques
that are key for fine-tuning a language
model. Our method significantly outperforms
the state-of-the-art on six text classification
tasks, reducing the error by 18-
24% on the majority of datasets. Furthermore,
with only 100 labeled examples, it
matches the performance of training from
scratch on 100× more data. We opensource
our pretrained models and code

I propose that spacy adds their pre-trained models and a simple way to fine tune to a new task as a core feature of the library.

Could the feature be a custom component or spaCy plugin?

If so, we will tag it as project idea so other users can take it on.

This seems like a core feature of spacy, greatly increasing its industrial potential. I would argue to make it a first class citizen if authors and licensing of this work permits that.

enhancement

Source

jmizgajski

👍46

Most helpful comment

Author here. I'd love to see this happen and I'm sure @jph00 would also be on board. Fast.ai is working on pre-trained models for other languages and we'll be working to simplify and make the code more robust.

sebastianruder on 18 May 2018

🎉12 👍10

All 10 comments

This! 👍

wojciak on 18 May 2018

Like It.

ngoel17 on 18 May 2018

sebastianruder on 18 May 2018

🎉12 👍10

For sure - I discussed the basic idea of LM fine-tuning with @honnibal recently. I'd be happy to improve integration between fastai's language modeling and forthcoming model zoo and spacy. Our model zoo should work fine with anything based on pytorch - to work with thinc would require porting the architecture and weights, of course.

(Note that this would require also porting the various regularization approaches in AWD LSTM to thinc too, since they're critical to this approach.)

jph00 on 18 May 2018

👍8

Would love to do the pre-trained Turkish model!

DuyguA on 18 May 2018

Super keen on this! @jph00 the vision for plugging in other libraries is to have Thinc as a thin wrapper on top. I've just merged a PR on this, and have fixed up an example of wrapping a BiLSTM model and inserting it into a Thinc model: https://github.com/explosion/thinc/blob/master/examples/pytorch_lstm_tagger.py#L122

You can find the wrapper here: https://github.com/explosion/thinc/blob/master/thinc/extra/wrappers.py#L13

This wrapping approach is the long-standing plan for plugging "foreign" models into spaCy and Prodigy. We want to have similar wrappers for Tensorflow, DyNet, MXNet etc. The Thinc API is pretty minimal, so it's easy to wrap this way.

Btw, as well as a plugin, I'm very interested in finding the right solution for pre-training the "embed" and "encode" steps in spaCy's NER, parser, etc. The catch is that our performance target is 10k words per second per CPU core, which I think means we can't use BiLSTM. The CNN architecture I've got is actually pretty good, and we're currently only a little off the target (7.5k words per second in my latest tests).

honnibal on 19 May 2018

👍9

Going from initializing the first layer of our models to pretraining the entire model with hierarchical representations is a must! For additional inspiration, check out NLP's ImageNet moment has arrived by The Gradient.