Allennlp: Caching CNNs should result in ELMo speed up

Created on 25 Jul 2018 · 14Comments · Source: allenai/allennlp

Hi,
I have a real time application that uses the allennlp bidaf model as one of its components. At real time the bidaf reader has to read around three passages(each passage with 10-15 sentences). These passage have to be read separately, not combined. I use predict_batch_json method to achieve this but I somehow feel its slower, as it takes around ~5s to read and answer each passage. That makes it around ~15s to read all three passage which is slow for a real time application.
Is this an expected speed? Is there anyway I can make it faster like using GPU during prediction?

Source

kaushalshetty

Most helpful comment

@kaushalshetty There is a way to cache about 35% of the cost of elmo on the CPU - it's a little complicated and will require retraining your model.

Basically, the idea is that the first layer of elmo is just CNN character convs - these are independent for each word. So if you have a vocabulary which you know will make up 90% of your input words, you can run this once, and just do a word embedding lookup in the case that a batch contains only words that you have pre-cached.

In order to do this, you will need to do the following to your config:

{
  "dataset_reader": {
    "type": "squad",
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      },
      "elmo": {
        "type": "elmo_characters"
      },
      "token_characters": {
        "type": "characters",
        "character_tokenizer": {
          "byte_encoding": "utf-8",
          "start_tokens": [259],
          "end_tokens": [260]
        }
      }
    }
  },
  "train_data_path": "....",
  "validation_data_path": "....",
  "model": {
    "type": "bidaf",
    "text_field_embedder": {
      "tokens": {
        "type": "embedding",
        "pretrained_file": "....",
        "embedding_dim": 100,
        "trainable": false
      },
      "token_characters": {
        "type": "character_encoding",
        "embedding": {
          "num_embeddings": 262,
          "embedding_dim": 16
        },
        "encoder": {
          "type": "cnn",
          "embedding_dim": 16,
          "num_filters": 100,
          "ngram_filter_sizes": [5]
        },
        "dropout": 0.2
      },
     "elmo": {
            "type": "elmo_token_embedder",
            "dropout": 0.2,
            "options_file": "....",
            "weight_file": "...",
            "do_layer_norm": false,
            // IMPORTANT: Specifies which namespace to cache.
            "namespace_to_cache": "tokens"
       },
    // IMPORTANT: Specifies which token embedders use which indexers.
    // See that now ELMo gets passed both characters AND tokens. This
    // means that if all the tokens are present in the cached char embedding,
    // it will use that instead.
    "embedder_to_indexer_map": {
      "tokens": ["tokens"],
      "token_characters": ["token_characters"],
      "elmo": ["elmo", "tokens"]
    }

Let me know if you can get that working.

DeNeutoy on 25 Jul 2018

👍2

All 14 comments

I just checked our demo model, and it doesn't look like it's using ELMo, so I wouldn't expect it to be this slow. You said you're running it on a CPU? I'd recommend running it on the GPU if you want it to be faster. Also, predict_batch_json should be running the three predictions in one big tensor on the GPU, if you're doing it right it shouldn't triple the runtime. How are you running the predictions? If you wrote your own script, it's possible you've written it such that you're actually loading a new model each time.

matt-gardner on 25 Jul 2018

Thanks for the reply @matt-gardner
No, we changed the config file as to use the ELMo Embeddings since it is giving me better accuracy.

Yes, I ran the code containing predict_batch_json method in a CPU but after your suggestions I will run it in a GPU. I am sure it will help.
As for the code here it is,
from allennlp.predictors.bidaf import BidafPredictor
bidaf_predictor = BidafPredictor.from_path("/apps1/common/allennlp/trained_models/Trained_BGdata_Elmo_model/model.tar.gz")
inputs1=[{"passage":doc,"question":question} for doc in doc_content_list]
reader_output_2 = bidaf_predictor.predict_batch_json(inputs1)

In the above snippet, it is one single question queried on multiple passages stored in doc_content_list.
Please let me know If I am doing anything wrong.

kaushalshetty on 25 Jul 2018

It looks like your code is right. Just be sure you're not running the BidafPredictor.from_path() line multiple times. But, yeah, if you're running ELMo on a CPU, I'd expect it to be very slow. A GPU will help a lot, but it'll still be a lot slower than without ELMo; we need to do some more work to make prediction faster when using a language model to embed your inputs.

matt-gardner on 25 Jul 2018

👍1

Got it. Thanks a lot. Closing this issue.

kaushalshetty on 25 Jul 2018

@kaushalshetty There is a way to cache about 35% of the cost of elmo on the CPU - it's a little complicated and will require retraining your model.

In order to do this, you will need to do the following to your config:

{
  "dataset_reader": {
    "type": "squad",
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      },
      "elmo": {
        "type": "elmo_characters"
      },
      "token_characters": {
        "type": "characters",
        "character_tokenizer": {
          "byte_encoding": "utf-8",
          "start_tokens": [259],
          "end_tokens": [260]
        }
      }
    }
  },
  "train_data_path": "....",
  "validation_data_path": "....",
  "model": {
    "type": "bidaf",
    "text_field_embedder": {
      "tokens": {
        "type": "embedding",
        "pretrained_file": "....",
        "embedding_dim": 100,
        "trainable": false
      },
      "token_characters": {
        "type": "character_encoding",
        "embedding": {
          "num_embeddings": 262,
          "embedding_dim": 16
        },
        "encoder": {
          "type": "cnn",
          "embedding_dim": 16,
          "num_filters": 100,
          "ngram_filter_sizes": [5]
        },
        "dropout": 0.2
      },
     "elmo": {
            "type": "elmo_token_embedder",
            "dropout": 0.2,
            "options_file": "....",
            "weight_file": "...",
            "do_layer_norm": false,
            // IMPORTANT: Specifies which namespace to cache.
            "namespace_to_cache": "tokens"
       },
    // IMPORTANT: Specifies which token embedders use which indexers.
    // See that now ELMo gets passed both characters AND tokens. This
    // means that if all the tokens are present in the cached char embedding,
    // it will use that instead.
    "embedder_to_indexer_map": {
      "tokens": ["tokens"],
      "token_characters": ["token_characters"],
      "elmo": ["elmo", "tokens"]
    }

Let me know if you can get that working.

DeNeutoy on 25 Jul 2018

👍2

Re-opening until we see a speedup from caching char-cnn embeddings.

DeNeutoy on 25 Jul 2018

So what you are saying is that I pass all my sentences to a CNN character convs(Elmo first layer) assuming that the sentences compose 90% of my vocabulary. After which I create a word embedding lookup serving as a cache. Thats something which I can try.
But I do have some questions on this:
Am I right in assuming that the word embedding lookup will be the output of the first layer of character CNN convs?
How can ELMo language model capture semantics if the layer is independent of each other?

kaushalshetty on 26 Jul 2018

The only part that is cached is the character convolutions, which are independently computed for each word, which allows the representation formed for common words to be cached. This is only the first layer of elmo - the two LSTM layers allow contextual representations to be learnt. To be clear, what I am suggesting is identical to ELMo, just with a part of it pre-cached for frequent words. If you modify the relevant parts of the bidaf.json config which I posted above, this should work pretty smoothly.

DeNeutoy on 26 Jul 2018

Clear now. Will try the above code and let you know what I get.

kaushalshetty on 26 Jul 2018

👍1

Re: GPU vs CPU--here's a (rather old) data point: https://github.com/allenai/allennlp/pull/384

schmmd on 26 Jul 2018

👍1

@schmmd @matt-gardner @DeNeutoy Thanks guys. I ran the same piece of code in GPU and yes, I got a speedup . Reduced the time from 11s(on CPU) to 3s(on GPU) to read three passages.That's a huge difference for me.

But I would I also like to try caching the char cnn embeddings as @DeNeutoy said and see if I can further lessen the time.

kaushalshetty on 26 Jul 2018

@kaushalshetty no problem, thanks for reporting back your numbers! We look forward to hearing how your next experiment works out.

schmmd on 26 Jul 2018

how can I access trained MRC models from allennlp library?

Prav2018 on 1 Aug 2018

Closing due to inactivity, I have verified already that caching elmo provides a speedup.

DeNeutoy on 30 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Configuration error on coreference resolution while using model coref-bert-lstm-2020.02.12.

lighteternal · 4Comments

Confusion about ELMo vectors

flyaway1217 · 4Comments

token_character_encoder dies with a sentence with short tokens

masashi-y · 4Comments

Can we use the allennlp for the other language, such as Dari, I want to implement the Coreference Resolution

ghezalahmad · 4Comments

Use complete pretrained-embedding file for vocab creation

nitishgupta · 3Comments