Allennlp: Confusing behaviour for models with and without including test vocabulary.

Created on 24 Jan 2019  路  9Comments  路  Source: allenai/allennlp

I have trained two ESIM entailment models with exact same config file, except one with test vocabulary included and one without.

  "train_data_path": "snli_1.0_train.jsonl",
  "validation_data_path": "snli_1.0_dev.jsonl",

vs

  "train_data_path": "snli_1.0_train.jsonl",
  "validation_data_path": "snli_1.0_dev.jsonl",
  "test_data_path": "snli_1.0_test.jsonl",

I understand that their test performances should be different because one would have UNKs and other won't, when indexed. However, even their validation loss/metrics (and also train loss/metrics) don't go 100% parallel. With either test included or excluded, the indexed data of train and val data should be exactly same in both cases, right? So, I had expected that both models should have exact same evaluation metric validation set, but they aren't.

Random seeds, etc are same because of same config file. I used ESIM, just to verify this behaviour - The model is simple enough, doesn't depend on size of the vocabulary.

Is this discrepancy expected? I could be missing something obvious here, but don't see it yet.

All 9 comments

You won't get the same random number draws because you initialize the embedding matrices with different sizes. This means that you can't expect identical performance across the two settings.

Thanks! However, following check seems contradictory:

# check1.py
import torch
torch.manual_seed(1)
print(torch.randn(6))
# check2.py
import torch
torch.manual_seed(1)
print(torch.randn(5))

and I get:

# check1.py output:   tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519, -0.1661])
# check2.py output:   tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519])

Let me load the model and check that in fact their embedding matrix are exactly the same.

Oops, I think we use different function to initialize matrix - not torch.randn. Let check that again.

The issue is the random state _after_ the initialization. You have a different number of draws in those two cases, so if you have, e.g., a dataset shuffle after each one, you'll end up with a different ordering.

We use normal_ to initialize. I tried above check1 and check2 with torch.FloatTensor(5, 2).normal_(5, 0.2), vs torch.FloatTensor(6, 2).normal_(5, 0.2) and results are similar - first 5 rows same. However, we take the mean and SD of embedding-matrix to make this initialization. So in case 1, it will be mean, std of train-num + val-num and in case 2, of train-num + val-num + test-num : ( So embedding initialization would indeed be different : (

@matt-gardner, I believe what you are saying is independent to this. But I don't understand your last comment. Can you please explain more? Do you mean if dataset shuffling is off, random state problem won't happen?

Training performance depends on the order that you see your training examples. The order that you see your training examples depends on the random state at the time of the training data shuffle. This random state depends on the size of the matrices that get initialized using the random state. The sizes are different in your two cases, which results in different random states after initialization, affecting _everything that happens afterward_.

Okay, I understand. I guess there are both these reasons. Thank You for help!!

I did check1 and check2 printing the numbers twice and second are no longer same:

tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519, -0.1661])  # same
tensor([-1.5228,  0.3817, -1.0276, -0.5631, -0.8923, -0.0583])  # diff
tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519])  # same
tensor([-0.1661, -1.5228,  0.3817, -1.0276, -0.5631])   # diff

Thanks for help!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DeNeutoy picture DeNeutoy  路  4Comments

DanBigioi picture DanBigioi  路  4Comments

silencemaker picture silencemaker  路  4Comments

flyaway1217 picture flyaway1217  路  4Comments

masashi-y picture masashi-y  路  4Comments