Flair: transformer models for language model training and tag prediction instead of LSTM's

Created on 15 Aug 2018  路  26Comments  路  Source: flairNLP/flair

I recently read the generative pretraining paper of openAI.
According to the benchmarks, fine-tuning the openAI model on a custom dataset takes a very less amount of time compared to a LSTM based approach.
Also the model has shown to improve SOTA in a lot of tasks.
So I was wondering if it is possible to replace the pipeline by a transformer based model implemented by OpenAI.

feature help wanted wontfix

Most helpful comment

Hi guys, I've made some update and a new release for these stuff: https://github.com/huggingface/pytorch-pretrained-BERT/releases/tag/v0.5.1

Keep up with the good work on flair.

All 26 comments

Great idea - we've been discussing this internally and really want to try it out, and compare the two approaches! Any help / pointers are appreciated :)

https://github.com/huggingface/pytorch-openai-transformer-lm has an implementation of transformer model in Pytorch and scripts to load openai transformer weights.
Will have a look at it this weekend and check out the feasibility of the implementation.

Great, thanks! Perhaps this code can be the basis of new transformer-based LanguageModel and LanguageModelTrainer classes!

A deep Transformer model achieves state-of-the-art results also in language modeling now, see this paper. So I think integrating such an architecture in flair would be awesome :heart:

But don't look at the evaluation section in the paper mentioned above ;) it took more than 7 days on a single Cloud TPU :scream:

64 layers wow...
i don't think implementing such a huge network would be feasible since it would slow down the training of further models in the pipeline quite considerably. However their 12 layer network also yielded some decent results.
The concept of auxiliary losses is good and will have to test out and see how that works out.

Small update: We are going to add the BERT embeddings (see https://github.com/zalandoresearch/flair/issues/251) in the next release to flair. They are based on transformers.

We are still thinking of adding our own transformer model at one point. But not in the near future.

alright 馃憤

@alanakbik and @tabergma : Here's another great paper about a Transformer-based LM:

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

-> Yesterday they provided both a TensorFlow and PyTorch implementation of the model. I'm going to play with the implementation now, maybe I find a way to get embeddings for a sentence (like it is done with FlairEmbeddings).

Wow this looks really interesting!

Two PR's from the pytorch-pretrained-BERT repository are very interesting:

Once they're merged I would like to add them to flair :)

Training a Transformer-XL model is possible, but on one GPU I had to use a smaller Transformer model (but I'm currently do some experiments with it...)

Yeah that would be great! :) Also, we'd be very interested to hear about your experiments with Transformer-XL!

Version 0.5.0 is out now: https://github.com/huggingface/pytorch-pretrained-BERT/releases/tag/v0.5.0

I'll check the integration of OpenAI GPT and the Transformer-XL now :)

Wow awesome!

Wow this is awesome. Really look forward to transformer-based models and fine-tuning-based models.

Two current caveats:

  • OpenAI GPT needs two libraries to be installed (not covered by pytorch-pretrained-BERTs dependency management): ftfy and spacy. For spacy you also need to manually install the English model with: python -m spacy download en. Then it works fine, I was able to get embeddings of a sentence
  • Transformer-XL: I wasn't able to get proper embeddings, a "nan" tensor was returned. But I opened an issue, see here :)

Ah thanks for the update - do you know why OpenAI requires spacy, and why the English models? Only for tokenization?

Hi guys, I've made some update and a new release for these stuff: https://github.com/huggingface/pytorch-pretrained-BERT/releases/tag/v0.5.1

Keep up with the good work on flair.

I've implemented an early draft of TransformerXLEmbeddings + I'm currently training on CoNLL 2003 dataset. I'll report the results here soon :)

Bzw: Second version of GPT is out: https://github.com/openai/gpt-2/blob/master/README.md

@stefan-it In my understanding, TransformerXLEmbeddings supports varied sentences length, so it won't have out-of-index issue from BertEmbedding, because Bert has fixed length of 512. Is it correct?

@stefan-it @thomwolf wow that's great - really looking forward to seeing this in action! And very interested to hear how well it does on CoNLL 03 and other tasks.

Here's another Transformer-based architecture, that uses a new approach for pretraining (cloze-style token reconstruction task is embedded during training):

https://arxiv.org/abs/1903.07785

It also achieves new SOTA on CoNLL-2003 NER: 93.5% (compared to flair: 93.18%)

Very impressive results - look forward to taking a closer look at this!

One major drawback is the ridiculous amount of training data 馃ぃ Unfortunately, there's currently no implementation/model available.

I just asked @michaelauli if they plan to release the code and model :) [I could imagine that it will be integrated in fairseq, but this is just speculation]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aschmu picture aschmu  路  3Comments

isanvicente picture isanvicente  路  3Comments

Aditya715 picture Aditya715  路  3Comments

Rahulvks picture Rahulvks  路  3Comments

alanakbik picture alanakbik  路  3Comments