I am trying to train a multitask network with multiple binary outcomes. Each datapoint has missing labels for many of the outputs.
The network I am building is based on Ramsundar et al., 2015
I looked around some related issues:
Because of the missing labels, I do not think the graph model or multiple objective functions will work.
What I actually do need is target-based masking during training, which I achieved using custom functions for loss and metrics:
#toy problem
MASK_VALUE = -1
n = 25 # # datapoints
n_tasks = 19 # tasks / # binary classes
input_dim= 2048 # vector size
# generate random X vectors and random
# Y labels (binary labels [0,1] or -1 for missing value
x = np.random.rand(n, input_dim)
x_test = np.random.rand(5, input_dim)
y = np.random.randint(3, size=(n, tasks))-1
def build_masked_loss(loss_function, mask_value=MASK_VALUE):
"""Builds a loss function that masks based on targets
Args:
loss_function: The loss function to mask
mask_value: The value to mask in the targets
Returns:
function: a loss function that acts like loss_function with masked inputs
"""
def masked_loss_function(y_true, y_pred):
mask = K.cast(K.not_equal(y_true, mask_value), K.floatx())
return loss_function(y_true * mask, y_pred * mask)
return masked_loss_function
def masked_accuracy(y_true, y_pred):
total = K.sum(K.not_equal(y_true, MASK_VALUE))
correct = K.sum(K.equal(y_true, K.round(y_pred)))
return correct / total
# create model
model = Sequential()
model.add(Dense(1000, activation='relu', input_dim=input_dim))
model.add(Dense(n_tasks, activation='sigmoid'))
model.compile(loss=build_masked_loss(K.binary_crossentropy), optimizer='adam', metrics=[masked_accuracy])
model.fit(x, y)
This trains the network successfully, decreasing the loss and increasing the masked accuracy at each epoch.
However I am not sure wether the outputs are truly n_task independent predictions. I used sigmoid which should be elementwise.
Unfortunately, model.predict_classes(x_test) returns one class for each datapoint (interpreting n_tasks as the number of classes.
I can use model.predict(x_test).round() which seems to work. I actually wanted to use the keras.wrappers.scikit_learn.KerasClassifier, which uses predict_classes.
I am also interested in ranking, and I use model.predict_proba(x_test) for that. The probabilities do not seem to add up to 1 row-wise, so I think that means that these probs are indeed independent.
Is my workaround correct, can I still use the scikit wrapper and can I trust the ranking of the probabilities?
EDIT2: I fixed a bug in the masking function
EDIT: I see now that np.allclose(model.predict(x), model.predict_proba(x)) is true. I could use predict_proba using the wrapper and use it to predict probabilities.
EDIT: I see now that np.allclose(model.predict(x), model.predict_proba(x)) is true. I could use predict_proba using the wrapper and use it to predict probabilities.
If I'm using multiple output layers (with functional API), how can I apply Masking layer to one of the whole output?
Let's say there's two tasks, and some labels are missing.
x = Input(input_shape=(32, 32))
conv = Conv2d(3, 3)(x)
conv = Conv2d(3, 3)(conv)
conv = MaxPooling2d(32, 32)(conv)
conv = Flatten()(conv)
out1 = Dense(10)(conv)
out2 = Dense(20)(conv)
model = Model(input=x, outputs=[out1, out2])
And for example, in a batch with batch_size=2, 1st sample has both labels but the second one has only labels for out1. so that I wanna mask the out2 for the 2nd sample. How can I do this? Looks like Masking layer only deal with masking for different time steps, not for samples.
@tivaro
Hi
I am trying to run your code but it doesn't work. what is tasks in the line
y = np.random.randint(3, size=(n, tasks))-1
and also get an error in
File "/home/gilad/tivaroMulti.py", line 41, in masked_accuracy
total = K.sum(K.not_equal(y_true, MASK_VALUE))
File "/home/gilad/keras-master/keras/backend/tensorflow_backend.py", line 459, in sum
return tf.reduce_sum(x, reduction_indices=axis, keep_dims=keepdims)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1100, in reduce_sum
keep_dims, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2678, in _sum
keep_dims=keep_dims, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 573, in apply_op
_Attr(op_def, input_arg.type_attr))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: DataType bool for attr 'T' not in list of allowed values: float32, float64, int64, int32, uint8, uint16, int16, int8, complex64, complex128, qint8, quint8, qint32, float16
@keunwoochoi
Hi. I am also interested in the same problem. I have 7 classification task on a Image dataset and some examples have missing labels for some tasks. Please post any solution if you have solved it.
Hey!
@giladdiv
I do not use Tensorflow myself, but it looks like the error arises because of the summation of bool-values. Perhaps you could try to replace
total = K.sum(K.not_equal(y_true, MASK_VALUE))
correct = K.sum(K.equal(y_true, K.round(y_pred)))
with
dtype = K.floatx()
total = K.sum(K.cast(K.not_equal(y_true, MASK_VALUE)), dtype)
correct = K.sum(K.cast(K.equal(y_true, K.round(y_pred)), dtype))
or some otherdtype ('int32', 'uint64', etc).
@keunwoochoi
In general, I think the best solution to the missing-labels multitask problem is to concatenate and mask your outputs. This is only possible if they all use the same loss function, and if that loss function can be applied pairwise.
For multitask networks with missing labels, it is important that the network only learns on loss that is caused by know labels. That is what the masking is for.
By masking the final predictions of the network, you make sure that the loss for those predictions will not be 'connected' to the inputs/weights of the network. As a result, the error will not be propagated for those values, and the network will not learn from it's errors on the missing values.
As described in #3206, you could just create a new input for the net that corresponds to the mask. Every time you want to make a prediction, you have to provide the net with a list of [data, mask]-matrices, where mask corresponds to the known (1) or missing (0) labels.
The advantage of this approach is that it is relatively easy to implement (just use the code provided in this issue).
The downside of this approach is that the loss and accuracy become a bit more difficult to interpret. Even though missing values will not be used in learning, they do still contribute to the loss and accuracy.
Due to the masking, the network will always 'predict' 0 for the missing labels. If the value that you give these labels is 0, then the network will always 'predict' these labels correctly.
So if 80% if your labels are missing, your net will always at least predict 80% correct. Luckily, the loss of these values will not contribute (because there is no loss).
Furthermore, the proportion of missing labels should be consistent within each dataset (train/test/valid) for each epoch, so the values are comparable across epochs. You would just have to rescale them, or interpret them cautiously.
My code above comes down to the same solution, but I try to do the computations internally, avoiding the need to pass the mask manually. However, it imposes quite a few modifications, so it may be less straightforward.
@tivaro I just checked your answer, thanks a lot. I understand how the code computes loss and accuracy. But does your code still backpropagate the error from the missing label outputs? How does the backpropagation chain blocked?
Thanks @tivaro, this helped me!
@tivaro Thank you for your answer. However, I don't understand how I can evaluate trained model on test data with missing label? Do I have to apply mask to evaluation?
@viet2411 yes, the same masking would do the work.
@keunwoochoi What about model.predict()? I want to use predictions to draw confusion matrix for each task
As tivaro mentioned above, np.allclose(model.predict(x), model.predict_proba(x)) return an array with True-False results on test data, so can I use this to plot confusion matrix?
Most helpful comment
Hey!
@giladdiv
I do not use Tensorflow myself, but it looks like the error arises because of the summation of bool-values. Perhaps you could try to replace
with
or some other
dtype('int32', 'uint64', etc).@keunwoochoi
In general, I think the best solution to the missing-labels multitask problem is to concatenate and mask your outputs. This is only possible if they all use the same loss function, and if that loss function can be applied pairwise.
For multitask networks with missing labels, it is important that the network only learns on loss that is caused by know labels. That is what the masking is for.
By masking the final predictions of the network, you make sure that the loss for those predictions will not be 'connected' to the inputs/weights of the network. As a result, the error will not be propagated for those values, and the network will not learn from it's errors on the missing values.
As described in #3206, you could just create a new input for the net that corresponds to the mask. Every time you want to make a prediction, you have to provide the net with a list of
[data, mask]-matrices, where mask corresponds to the known (1) or missing (0) labels.The advantage of this approach is that it is relatively easy to implement (just use the code provided in this issue).
The downside of this approach is that the loss and accuracy become a bit more difficult to interpret. Even though missing values will not be used in learning, they do still contribute to the loss and accuracy.
Due to the masking, the network will always 'predict' 0 for the missing labels. If the value that you give these labels is 0, then the network will always 'predict' these labels correctly.
So if 80% if your labels are missing, your net will always at least predict 80% correct. Luckily, the loss of these values will not contribute (because there is no loss).
Furthermore, the proportion of missing labels should be consistent within each dataset (train/test/valid) for each epoch, so the values are comparable across epochs. You would just have to rescale them, or interpret them cautiously.
My code above comes down to the same solution, but I try to do the computations internally, avoiding the need to pass the mask manually. However, it imposes quite a few modifications, so it may be less straightforward.