Allennlp: Potential bug: The maxpool in cnn_encoder can be triggered by pad tokens.

Created on 20 Oct 2020 · 6Comments · Source: allenai/allennlp

Description

When using a text_field_embedder -> cnn_encoder (without seq2seq_encoder), the output of the embedder (and mask) get fed directly into the cnn_encoder. The pad tokens will get masked (set to 0), but it's still possible that after applying the mask followed by the CNN, the PAD tokens are those with highest activations. This could lead to the same exact datapoint getting different predictions if's part of a batch vs single prediction.

Related issues or possible duplicates

None

Environment

OS: NA

Python version: NA

Steps to reproduce

This can be reproduced by replacing
https://github.com/allenai/allennlp/blob/00bb6c59b3ac8fdc78dfe8d5b9b645ce8ed085c0/allennlp/modules/seq2vec_encoders/cnn_encoder.py#L113

filter_outputs.append(self._activation(convolution_layer(tokens)).max(dim=2)[0])

with

activated_outputs, max_indices = self._activation(convolution_layer(tokens)).max(dim=2)

and checking the indices for the same example inside of a batch vs unpadded.

Possible solution:

We could resolve this by adding a large negative value to all CNN outputs for masked tokens, similarly to what they do in the transformers library (https://github.com/huggingface/transformers/issues/542, https://github.com/huggingface/transformers/blob/c912ba5f69a47396244c64deada5c2b8a258e2b8/src/transformers/modeling_bert.py#L262), but I have not been able to figure out how to do this efficiently.

bug

Source

MichalMalyska

👀1

All 6 comments

I created a google collab notebook that showcases how the problem arises and potential solution:
https://colab.research.google.com/drive/1i73ZCEdPGRjS_hKNlrqHBklITME2hGKL?usp=sharing

The solution I created creates an additive mask with large negative values for all filter activations that involved a pad token anywhere. It is a bit clunky so I would appreciate any ideas on how to make this more readable / efficient before creating a PR to fix the bug.

MichalMalyska on 21 Oct 2020

❤1

Thanks for the detailed report @MichalMalyska. I'll take a look at this soon.

epwalsh on 21 Oct 2020

@MichalMalyska It seems like you have a reasonable solution, would you mind opening up a PR with your fix?

By the way, you can utilize allennlp.nn.util.min_value_of_dtype() to get an appropriate large negative number for the masked values instead of always using 10e-5.

epwalsh on 21 Oct 2020

Yeah, I'll make a PR + some tests on the new seq2vec to ensure the batched and non-batched predictions have the same results. Similar tests are probably in order for other seq2vec encoders that don't use the pack_padded_sequence and pad_packed_sequence methods, and for potentially seq2seq just to ensure masking is actually working as intended.

There is one design decision that needs to be made:
Since the approach requires a mask to be not none, we can either:

create a new one of all ones at the start if the original one is None
Wrap all of this in an if statement that runs the old version of code if mask is None and the new one otherwise.

I am partial to just creating a mask tensor of all True, since it seems much more elegant and doesn't increase the code complexity but it adds memory overhead.

MichalMalyska on 21 Oct 2020

If creating a mask of all ones significantly reduces the complexity, then I'm for it. It should be rare that mask is actually None, so I'm not too worried about the performance penalty.

epwalsh on 21 Oct 2020

👍1

@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜