I tried, but as an output i get a vector of 300 dim, whereas i should be getting a single score as a distance between two tensors
from scipy.spatial.distance import cityblock
def Manahtten_distance(M):
A=M[0]
B=M[1]
res=cityblock(A,B)
return res
merged_vector = merge([encoded_a, encoded_b],mode=Manahtten_distance, output_shape=(1,))
Hello,
You have to define your operation using Keras, so that they are symbolic. You can't directly reuse functions from scipy.
You should probably create a custom layer instead of using Merge, as it will be easier and less bug prone to reuse.
Here is one correct way of using Merge to compute the Manhattan distance.
Running (but not tested) code
import numpy as np
from keras.models import Model
from keras.layers import Input
import keras.backend as K
from keras.layers.core import Merge
def Manhattan_distance(A,B):
return K.sum( K.abs( A-B),axis=1,keepdims=True)
inp1 = Input( shape=(100,))
inp2 = Input( shape=(100,))
merged_vector = Merge(mode=lambda x:Manhattan_distance(x[0],x[1]), output_shape=lambda inp_shp:(inp_shp[0][0],1))([inp1,inp2])
m = Model([inp1,inp2],[merged_vector])
print m.predict( [np.random.randn(30,100),np.random.randn(30,100)] )
That's working fine. what about contrastive_loss? is following code correct? I tried but i do not achieve accuracy with it nor loss decreases
def contrastive_loss(y, d):
""" Contrastive loss from Hadsell-et-al.'06
http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
"""
margin = 1
return K.mean(y * K.square(d) + (1 - y) * K.square(K.maximum(margin - d, 0)))
Your loss seems right. Maybe you are not using it correctly.
y should be 1 when both layers should be the same.
y should be 0 when layers should differ.
d must be the distance (and not the distance^2). Caffe implementation use L2 distance.
Maybe you can set a different margin, or make sure that the distance is of the same order as "1". (you may need to divide by the number of dimensions). Because if d > margin for all points then the gradient is 0 and the parameters won't move and the loss won't decrease when y = 0 .
Maybe you need to have correct imbalance between same and different class (~ 50% each)
I'm not familiar with contrastive loss but I've played with triplet loss, and there is some usefulness in picking carefully your negative ("different") examples, because if it's too easy for the network to pick the difference then the gradient are 0, and it doesn't learn the example)
This is how i am using it
input_a=Input(shape=(20, input_dim))
input_b=Input(shape=(20, input_dim))
shared_lstm = LSTM(50, dropout_W=0.0,dropout_U=0.0)
encoded_a = shared_lstm(input_a)
encoded_b = shared_lstm(input_b)
merged_vector = Merge(mode=lambda x:Manhattan_distance(x[0],x[1]), output_shape=lambda inp_shp:(inp_shp[0][0],1))([encoded_a,encoded_b])
model_lstm = Model([input_a,input_b],[merged_vector])
model_lstm.compile(loss=contrastive_loss, optimizer='adam', metrics=['accuracy'])
model_lstm.fit([X_train_sen1, X_train_sen2],y,nb_epoch=1,callbacks=callbacks_list, batch_size=60,shuffle=True,verbose=0)
here y ranges from 0 to 1, this is basically sentence similarity,
Because LSTM activation is a tanh, which takes value between -1.0 and 1.0, the average per coordinate distance is ~1.0, and you have 50 cells, this means that average total distance is ~50.0, and the gradient for the "different" part of the loss is 0.
You probably want to normalize by the number of dimensions (i.e. take mean instead of sum in Manhattan distance ).
Additionally you may want to regularize your learning, to give your network incentives to learn a meaningful representation, otherwise you will be facing a "temporal credit assignment problem", which will probably make your training very slow.
Currently your network is probably solving for your problem by outputting almost a constant : it will score well on positives examples, (and we have seen previously that the negative examples can't be learned due to the 0 gradient so they can't help it to improve). So the learned representation is not at all meaningful. You have to prevent your network from taking this bypass.
Yes training is very slow, infact pearson corelation between gold score and my prediction is in negative what does this mean? I have regularized as well. Loss strts from 0.5 and quickly reaches 0.2 and accuracy is stil negative, which means network is not learning anything
The "accuracy" metric is probably not relevant here (it doesn't compute what you want it to compute : looks at metrics source code).
Try displaying some predictions and investigating. Try displaying the predicted encodings.
I don't know about your correlation between target and prediction, it's probably just noise at this stage.
i'm not using the accuracy metric of keras, actually i am using "Pearson corelation". My main concern is that loss reaches aprox 0 very quickly, and i get no accuracy. Why is loss decreasing if network is not learning any thing? I am using wordvectors as input, should i normalize them?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Most helpful comment
Hello,
You have to define your operation using Keras, so that they are symbolic. You can't directly reuse functions from scipy.
You should probably create a custom layer instead of using Merge, as it will be easier and less bug prone to reuse.
Here is one correct way of using Merge to compute the Manhattan distance.
Running (but not tested) code