On paper it mentioned only GLUE datasets. I wonder how it would perform on binary, multiclass and multilabel text classification tasks, so that we can directly compare it with ELMO, ULMFIT.
A lot of the tasks that we did overlapped with ELMo (NER and MultiNLI in particular). For ULMFit unfortunately there was no overlap between BERT/ELMo/OpenAI GPT vs. ULMFit. The other thing that ULMFit explores that we didn't is in-domain LM adaptation. But we suspect that BERT will do quite well on task like IMDB, especially with LM adaptation (i.e., running pre-trianing on the unsupervised IMDB documents for 10,000 steps starting from the BERT model, before fine-tuning).
I didn't have the time & ressources yet to finetune the LM to IMDB and do some hyperparameter search, but with just 3-4 epochs of classification I get to ~0.93 (with max_seq_length=256 for training -colab limitation- and 512 for evaluation).
I pushed my changes for training on IMDB on my fork.
Most helpful comment
I didn't have the time & ressources yet to finetune the LM to IMDB and do some hyperparameter search, but with just 3-4 epochs of classification I get to ~0.93 (with max_seq_length=256 for training -colab limitation- and 512 for evaluation).
I pushed my changes for training on IMDB on my fork.