Spacy: textcat training is not deterministic with gpu enabled

Created on 10 Nov 2020  路  7Comments  路  Source: explosion/spaCy

How to reproduce the behaviour

This is related to #6177. I can verify that when using CPU, the training losses/weights for textcat can be deterministic with fix_random_seed. However, if I enable GPU via spacy.require_gpu(), the training losses/weights become different every time.

import spacy
spacy.require_gpu()

for _ in range(2):
    spacy.util.fix_random_seed(0)

    model = spacy.load('en_core_web_sm')

    model.add_pipe(model.create_pipe('textcat'))
    model.remove_pipe('parser')
    model.remove_pipe('tagger')

    cat = model.get_pipe('textcat')
    cat.add_label("dog")
    cat.add_label("donut")

    model.begin_training()
    print(model("What even is?").cats)

Output:

{'dog': 0.2501096725463867, 'donut': 0.3427947163581848}
{'dog': 0.9567031860351562, 'donut': 0.9506585001945496}

Your Environment

  • Operating System: Linux
  • Python Version Used: 3.6.9
  • spaCy Version Used: latest on master (git sha: 320a8b14814c7e0c6dce705ad7bf0f13bf64b61c)
  • Environment Information: Google Colab
bug feat / textcat gpu training

All 7 comments

Hmm, I can't reproduce this.

Can you double-check by explicitly uninstalling spacy in colab before installing from master? It's possible that the default spacy install isn't being replaced/uninstalled cleanly when you install from source.

What do you see in spacy.git_info.GIT_VERSION?

And what is your thinc version?

@adrianeboyd @svlandeg
spacy.__version__: 2.3.2
spacy.git_info.GIT_VERSION: 320a8b148
thinc: 7.4.1

I just wrote up a more detailed script: https://colab.research.google.com/drive/1lVJpVE-SS85jQP3LdkuZkhKvpBA0EuXM?usp=sharing

Hmm, I do think there may be a bug of some sort here in spacy v2. Locally and with the colab example above I get consistent results within multiple CPU and GPU runs (also with our quick internal test cases related to this), but the CPU and GPU results are not similar to each other, and if I extend the training a bit I do get different results for multiple GPU runs. We will look into it!

In better news, with spacy v3 I get the same results on both (minus some float rounding differences, of course).

I'd be happy to look into this further, but I can't reproduce... :(

If I run this on either CPU or GPU, I just keep getting consistent results, after installing a clean copy of spacy[cuda101].
I can run the training loop 200 times, just keep getting the same result.

The only thing I can think of right now, that this happens on Linux and not Windows? Though that makes little sense to me. @adrianeboyd : you couldn't replicate at first either - what exactly did you change to replicate this?

Here's my test script (just adapted a bit from the one in the colab example):

import spacy
from spacy.util import minibatch, compounding

def train():
    spacy.util.fix_random_seed(0)
    model = spacy.blank("en")

    model.add_pipe(model.create_pipe("textcat"))

    cat = model.get_pipe("textcat")
    cat.add_label("dog")
    cat.add_label("donut")

    x_train = [f"example {i}" for i in range(1000)]
    y_train = [{"cats": {"dog": i/1000, "donut": 1 - i/1000}} for i in range(1000)]
    train_data = list(zip(x_train, y_train))

    optimizer = model.begin_training()
    for i in range(10):
        batches = minibatch(train_data, size=compounding(16, 64, 1.001))
        losses = {}
        for batch in batches:
            x_batch, y_batch = zip(*batch)
            model.update(x_batch, y_batch, sgd=optimizer, drop=0, losses=losses)
        print(i, "loss:", losses["textcat"])
    print("example 10:", model("example 10").cats)
    print()

if __name__ == "__main__":
    print("1st time CPU:")
    train()
    print("2nd time CPU:")
    train()
    print("\nEnabling GPU\n")
    spacy.require_gpu()
    print("1st time GPU:")
    train()
    print("2nd time GPU:")
    train()

Output:

1st time CPU:
0 loss: 0.020526510332956605
1 loss: 0.2192715626588324
2 loss: 0.1541586974939264
3 loss: 0.21435572720838536
4 loss: 0.1982542650088135
5 loss: 0.19825033005452042
6 loss: 0.19787737677813766
7 loss: 0.016827800470196053
8 loss: 0.02887996903154999
9 loss: 0.02469563187116819
example 10: {'dog': 0.001906172838062048, 'donut': 0.6181842684745789}

2nd time CPU:
0 loss: 0.020526510332956605
1 loss: 0.2192715626588324
2 loss: 0.1541586974939264
3 loss: 0.21435572720838536
4 loss: 0.1982542650088135
5 loss: 0.19825033005452042
6 loss: 0.19787737677813766
7 loss: 0.016827800470196053
8 loss: 0.02887996903154999
9 loss: 0.02469563187116819
example 10: {'dog': 0.001906172838062048, 'donut': 0.6181842684745789}


Enabling GPU

1st time GPU:
0 loss: 0.022869700213050237
1 loss: 0.06781688092814875
2 loss: 0.15603950362856267
3 loss: 0.029185388615587726
4 loss: 0.04577569641696755
5 loss: 0.03271988184133079
6 loss: 0.030841199260066787
7 loss: 0.016764739026257303
8 loss: 0.023379557263069728
9 loss: 0.020565684088069247
example 10: {'dog': 0.15584374964237213, 'donut': 0.9999545812606812}

2nd time GPU:
0 loss: 0.022846033180030645
1 loss: 0.07457155887192357
2 loss: 0.1533858735638205
3 loss: 0.03846120528942265
4 loss: 0.030317590604681754
5 loss: 0.022946339027839713
6 loss: 0.040068494405659294
7 loss: 0.023592466532136314
8 loss: 0.02665060829349386
9 loss: 0.021907005400862545
example 10: {'dog': 0.15843163430690765, 'donut': 0.9288136959075928}

I tested in a new venv with everything from wheels except spacy (from master as of now). example 10 is the model cats output for the text "example 10".

example 10 for a few more GPU runs:

{'dog': 0.2435295134782791, 'donut': 0.9999375343322754}
{'dog': 0.4791581332683563, 'donut': 0.9981231093406677}
{'dog': 0.6463608145713806, 'donut': 0.016409972682595253}
{'dog': 0.14756248891353607, 'donut': 0.9230985045433044}

pip freeze: freeze.txt

I redid the test with v3 and the results are a bit more variable than I thought between CPU and GPU, but they're not that different across GPU runs.

CPU: {'dog': 0.0654868334531784, 'donut': 0.9892733693122864}
GPU 1: {'dog': 0.022449197247624397, 'donut': 0.9723042249679565}
GPU 2: {'dog': 0.02237524650990963, 'donut': 0.9726961255073547}
GPU 3: {'dog': 0.022426428273320198, 'donut': 0.9722701907157898}
GPU 4: {'dog': 0.02197781391441822, 'donut': 0.9722147583961487}

Thanks Adriane - the original script didn't have a model.update function which prevented reproducing this.

I was able to finally track this down to the ParametricAttention layer of the CNN model in the default textcat architecture. PR #6411 should fix this - but it requires an update of Thinc to 7.4.3 (to be released).

Was this page helpful?
0 / 5 - 0 ratings