Allennlp: Confusing behaviour for models with and without including test vocabulary.

Created on 24 Jan 2019 · 9Comments · Source: allenai/allennlp

I have trained two ESIM entailment models with exact same config file, except one with test vocabulary included and one without.

  "train_data_path": "snli_1.0_train.jsonl",
  "validation_data_path": "snli_1.0_dev.jsonl",

vs

  "train_data_path": "snli_1.0_train.jsonl",
  "validation_data_path": "snli_1.0_dev.jsonl",
  "test_data_path": "snli_1.0_test.jsonl",

I understand that their test performances should be different because one would have UNKs and other won't, when indexed. However, even their validation loss/metrics (and also train loss/metrics) don't go 100% parallel. With either test included or excluded, the indexed data of train and val data should be exactly same in both cases, right? So, I had expected that both models should have exact same evaluation metric validation set, but they aren't.

Random seeds, etc are same because of same config file. I used ESIM, just to verify this behaviour - The model is simple enough, doesn't depend on size of the vocabulary.

Is this discrepancy expected? I could be missing something obvious here, but don't see it yet.

Source

HarshTrivedi

All 9 comments

You won't get the same random number draws because you initialize the embedding matrices with different sizes. This means that you can't expect identical performance across the two settings.

matt-gardner on 24 Jan 2019

👍1

Thanks! However, following check seems contradictory:

# check1.py
import torch
torch.manual_seed(1)
print(torch.randn(6))

# check2.py
import torch
torch.manual_seed(1)
print(torch.randn(5))

and I get:

# check1.py output:   tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519, -0.1661])
# check2.py output:   tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519])

HarshTrivedi on 24 Jan 2019

Let me load the model and check that in fact their embedding matrix are exactly the same.

HarshTrivedi on 24 Jan 2019

Oops, I think we use different function to initialize matrix - not torch.randn. Let check that again.

HarshTrivedi on 24 Jan 2019

The issue is the random state _after_ the initialization. You have a different number of draws in those two cases, so if you have, e.g., a dataset shuffle after each one, you'll end up with a different ordering.

matt-gardner on 24 Jan 2019

We use normal_ to initialize. I tried above check1 and check2 with torch.FloatTensor(5, 2).normal_(5, 0.2), vs torch.FloatTensor(6, 2).normal_(5, 0.2) and results are similar - first 5 rows same. However, we take the mean and SD of embedding-matrix to make this initialization. So in case 1, it will be mean, std of train-num + val-num and in case 2, of train-num + val-num + test-num : ( So embedding initialization would indeed be different : (

@matt-gardner, I believe what you are saying is independent to this. But I don't understand your last comment. Can you please explain more? Do you mean if dataset shuffling is off, random state problem won't happen?

HarshTrivedi on 24 Jan 2019

Training performance depends on the order that you see your training examples. The order that you see your training examples depends on the random state at the time of the training data shuffle. This random state depends on the size of the matrices that get initialized using the random state. The sizes are different in your two cases, which results in different random states after initialization, affecting _everything that happens afterward_.

matt-gardner on 24 Jan 2019

👍1

Okay, I understand. I guess there are both these reasons. Thank You for help!!

HarshTrivedi on 24 Jan 2019

I did check1 and check2 printing the numbers twice and second are no longer same:

tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519, -0.1661])  # same
tensor([-1.5228,  0.3817, -1.0276, -0.5631, -0.8923, -0.0583])  # diff

tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519])  # same
tensor([-0.1661, -1.5228,  0.3817, -1.0276, -0.5631])   # diff

Thanks for help!

HarshTrivedi on 24 Jan 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Updates to the docs

DeNeutoy · 4Comments

Is it possible to use the BERT model instead of bidaf for reading comprehension?

DanBigioi · 4Comments

ModuleNotFoundError

silencemaker · 4Comments

Confusion about ELMo vectors

flyaway1217 · 4Comments

token_character_encoder dies with a sentence with short tokens

masashi-y · 4Comments