Allennlp: Trying to compare Elmo and OpenAi transformer embeddings

Created on 10 Nov 2018 · 10Comments · Source: allenai/allennlp

This is not an bug / issue. I am experimenting with open-ai pretrained transformer embedder and trying to compare it with elmo on a task. However OpenAi embedder is taking too much time to be able to experiment on my dataset. I want to know if anyone from the team has experienced similar.

For using OpenAi-transformer, I (almost) directly use indexer and embedder configs from (here). For Elmo I am not doing any caching.

Time for 1 epoch for small dataset of 10K instances:

Plain Glove  :  15 sec
Elmo         :  72 sec
OpenAi       :  869 sec        (> 10 times more than elmo)

All other config parameters like batch-size etc are the same in all three cases.

Is this expected? And if so, any way to make it faster?

Source

HarshTrivedi

All 10 comments

What's the task? What's the size of your input? The OpenAI transformer does self attention on its input, and that could be very large for large inputs (from a different conversation, I think it might even always pad everything to its max input length, which seems a bit silly). For much smaller inputs, it wouldn't surprise me at all for ELMo to be much faster, if the OpenAI code is indeed doing all of that padding. If it's not, I'd still expect ELMo to be faster if you're close to the max length for the transformer.

matt-gardner on 11 Nov 2018

👍1

@matt-gardner The task is textual entailment and this particular run was only with Snli dev set (10K pairs of sentences). In worst case I have max num_tokens as 200 in the reader but I don't think it reaches that since snli sentences are generally short. For this run, batch size was only 16 sentence pairs reducing the scope of too much padding.

Do you think still the difference can be so drastic?

HarshTrivedi on 11 Nov 2018

@matt-gardner The task is textual entailment and this particular run was only with Snli dev set (10K pairs of sentences). In worst case I have max num_tokens as 200 in the reader but I don't think it reaches that since snli sentences are generally short. For this run, batch size was only 16 sentence pairs reducing the scope of too much padding.

Do you think still the difference can be so drastic?

Which optimizer are you using? The standard adam optimizer?
Do you directly use the allennlp's openai_transformer_embedder or do you have a modified version?

tingkai-zhang on 11 Nov 2018

@tingkai-zhang I used openai_transformer_embedder. The indexer and embedder configs are almost directly taken from (here). Only difference is embedding dim for tokens is 300 instead of 50 and they are loaded from glove.

Optimizer is adam, but I don't think it should make any difference. I was only trying to estimate computation time difference for 1 epoch.

Did you use openai_transformer_embedder for any task. If so, did you experience something similar?

HarshTrivedi on 11 Nov 2018

@tingkai-zhang I used openai_transformer_embedder. The indexer and embedder configs are almost directly taken from (here). Only difference is embedding dim for tokens is 300 instead of 50 and they are loaded from glove.

Optimizer is adam, but I don't think it should make any difference. I was only trying to estimate computation time difference for 1 epoch.

Did you use openai_transformer_embedder for any task. If so, did you experience something similar?

The original implementation by openai has a self-defined optimizer where they have a learning rate warmup.

I use openai_transformer_embedder for both entailment and classification task.

I was wondering how do you get the representation for the sentence using openai_transformer_embedder as token embedder.

tingkai-zhang on 12 Nov 2018

For model to use openai embeddings for the tokens in the sentence, you only need to make appropriate changes in the config file (eg. link).

HarshTrivedi on 12 Nov 2018

The Open AI model is significantly slower then ELMo as the model is substantially larger then the ELMo model. It pads all sequences to n_ctx length, so you can speed it up by decreasing the n_ctx parameter to the maximum value in your dataset, if it is significantly lower then the 512 default.

matt-peters on 12 Nov 2018

👍1

@matt-peters Thanks for your suggestion! In my case max length is significantly lower. I will update it accordingly and see how much speedup I get.

HarshTrivedi on 12 Nov 2018

@HarshTrivedi did you see a speedup by reducing the max length? How much?

matt-gardner on 17 Nov 2018

@matt-gardner Sorry, forgot to report the numbers back! : (

Here are 1-epoch times to run on 10K snli instances (train and dev are same sets) with OpenAi embedder:

n_ctx = 80        :  124 sec
n_ctx = 100       :  152 sec
n_ctx = 200       :  294 sec
n_ctx = 300       :  472 sec
n_ctx = 400       :  650 sec
n_ctx = 512       :  869 sec

Thank you for helping with this!

HarshTrivedi on 17 Nov 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings