First of all thanks for your nice and clean implementation. I want to port decomposable attention to Keras v2 But I have a problem. I asked the question in SO but didn't get any answer, would you please help me to fix this issue.
SO thread:
https://stackoverflow.com/questions/46843131/use-merge-layer-lambda-function-on-keras-2-0
I haven't actually run this code, but I think something like the following should work:
~~~
def __call__(self, sent1, sent2):
def _outer(AB):
att_ji = K.batch_dot(AB[1], K.permute_dimensions(AB[0], (0, 2, 1)))
return K.permute_dimensions(att_ji, (0, 2, 1))
from keras.layers import Lambda
return Lambda(_outer, output_shape=(self.max_length, self.max_length))([self.model(sent1), self.model(sent2)])
~~~
I think the _outer function probably needs to be moved to the function scope as well --- otherwise it might have a problem deserialising the model?
I hope you can submit a pull request with your fixes once you get it working. You'll probably want to have a look for other issues about that example. I'm still not 100% sure the code implements Parikh's model correctly. I think the way the attention averages over the padded sentences might be wrong? There's good discussion of this in other issues.
I had a go at this, and saved my work on a branch here: https://github.com/explosion/spaCy/tree/example/keras-parikh-entailment
This is failing though. I can't be sure, but I think the problem is in the masking stuff? The masking was always a problem, and now I don't see how it's even supposed to work. I tried to find an answer on the issue tracker: https://github.com/fchollet/keras/issues?utf8=%E2%9C%93&q=masking . However Keras auto-closes issues after 30 days, so people just open new ones --- so there are almost 300 issues about masking. This makes it tough to understand the current recommendations :(
@honnibal On https://github.com/explosion/spaCy/tree/example/keras-parikh-entailment one mistake I can spot is that Entailment module is expecting nr_hidden * 2 but is being actually fed nr_hidden * 4 units at input Concat[AvgPool sent1, MaxPool sent1, AvgPool sent2, MaxPool sent2]
Though this will make the model compile, masking will indeed be required to deal with variable length sequences in the same batch. If I understand correctly, careful masking will be required in both Align (right before computing softmax) and Compare modules (while taking pools).
I might give it a try using Lambda layer at https://gist.github.com/braingineer/b64ca35223c7782667984d34ddb7a7fa
assuming zero embedding vector is reserved for padding. Will submit a PR if I can make it work.
@honnibal I really liked the method proposed by Parikh et al. and wanted an implementation in Keras. But like many, I struggled getting an algorithm that performed on par with their's. I had the same hunch about the masking issues and decided to go ahead and write a few layers that take care of this. It worked! You can find the implementation in the following Git repo: https://github.com/martbert/decomp_attn_keras. The implementation can certainly be improved but still, it's a good start.
Merging this with #2758.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.