Hello,
I'm trying to use ELMo for longer document (average about 600-900 words), and my batch size is 32. I'm using ELMo jointly with a uni-directional LSTM on top, and I'm using GTX 1080-Ti.
from allennlp.commands.elmo import ElmoEmbedder
elmo = ElmoEmbedder(cuda_device=args.gpu)
sent_reps, sent_mask = self.embed.batch_to_embeddings(input)
sent_reps = sent_reps.detach()
I'm using batch_to_embeddings() because this returns a PyTorch Variable instead of Numpy arrays so everything can stay on GPU.
I'm not looking to retrain ELMo, so I detached the returned variable, and my optimizer only optimizes parameters from my own model. However, I still get this cuda runtime error (2) : out of memory.
I guess my question is: is there a way to reduce memory consumption of ELMo? For example, is there an inference mode that I can use? It would also be helpful if there are memory-reducing tips in general for PyTorch since AllenNLP is built on PyTorch :)
I'm going to guess this is due to the document length. Batch sizes of 32 documents * 600 tokens is effective batch size 19,200 tokens.
The easiest work around is to use a smaller batch size.
You could also split your document into individual sentences, run the biLM over each individual sentence, then concatenate the sentences to get a sequence the length of your document.
Thank you Matthew for the prompt response!
I have considered splitting into multiple sentences, but since everything is "batched", it might be too difficult to deal with masking, or too inefficient to run ELMo on single sentences.
Would methods like requires_grad=False or volatile=True be helpful in reducing memory footprint?
We are setting requires_grad=False now but not volatile. That may help, not sure.
Other options: is it a single (or a few) very long documents that are causing the OOM? If so, you can truncate them to say 1000 tokens. Or is it running OOM with all documents?
The model seems to be able to run through a few iteration (less than 100). Probably truncation is the best idea :) I tried to lower batch size to 16 or 8 but not very successful.
After placing ELMo on it's own GPU, and my model on a different GPU, it definitely seems like a long document is causing the issue. I think I'm going to go with truncation :) Thank you!
Christopher Clark ran the SQuAD experiments so I'm not familiar with all the details. I do know that he:
(1) used a GPU with 12 GB of ram
(2) lowered the batch size with ELMo
(3) pre-computed the context insensitive token representations and only ran the biLSTM layers during training.
I think (3) was important to getting it to work. This involves running the character CNN-highway portion of the network for all tokens in the vocab and removing them during inference, replacing them with a static word embedding and just running the biLM. allennlp doesn't support this mode yet so it would require some modifications to get to work...
Thank you! This is very helpful!
I definitely think the ability to pre-computed the context insensitive token should be made public and more accessible! I now realize that ELMo runs a bit too slow...
When I use traditional word embeddings, it only takes 15 seconds per 100 iterations.
Now with ELMo it takes about 4 minutes per 100 iterations :)
@matt-peters would you happen to know if there's an example of doing this somewhere? I'm also running into memory issues due to long sequences. I imagine this is a rather common situation..
The character CNN caching has been implemented, use option vocab_to_cache: https://github.com/allenai/allennlp/blob/master/allennlp/modules/elmo.py#L63
Just in case anyone else stumbles upon this issue: precomputing the context insensitive token representations is also implemented in ELMo's TensorFlow implementation bilm-tf. Usage is documented in the README and in script usage_token.py.
What do you suggest a good optimizer when we are using ELMo representations in a bidirectional setup!
Most helpful comment
Christopher Clark ran the SQuAD experiments so I'm not familiar with all the details. I do know that he:
(1) used a GPU with 12 GB of ram
(2) lowered the batch size with ELMo
(3) pre-computed the context insensitive token representations and only ran the biLSTM layers during training.
I think (3) was important to getting it to work. This involves running the character CNN-highway portion of the network for all tokens in the vocab and removing them during inference, replacing them with a static word embedding and just running the biLM. allennlp doesn't support this mode yet so it would require some modifications to get to work...