Keras: Writing a metric is not easy.

Created on 25 Nov 2016  Â·  30Comments  Â·  Source: keras-team/keras

Recently some friend asked me to write a MAP(mean average precision) metric. Some sorting operation is required in this metric.

At the beginning, I plan to use K.get_value() to get the value in y_pred and y_true, then use sort methods provided by numpy or python to compute the metric value just like this:

def MAP(y_true, y_pred):

    np_y_true = K.get_value(y_true)
    np_y_pred = K.get_value(y_pred)

    zipped = zip(np_y_true, np_y_pred)
    zipped.sort(key=lambda x:x[1],reverse=True)

    np_y_true, np_y_pred = zip(*zipped)
    k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1]
    score = 0.
    r = np.sum(np_y_true).astype(np.int64)
    for k in k_list:
        Yk = np.sum(np_y_true[:k+1])
        score += Yk/(k+1)

    score/=r

    return K.variable(score)

However, this function doesn't work. Because a metric is a part of the computation graph, which means it must be a pure "tensor operation". I cannot get the value of y_true and y_pred because they are "empty" at this moment. Only when the real data was fed into the input tensor of the whole computation graph can we obtain the value of these tensors.

I think perhaps it's not good for metric function, because a metric function is actually not a part of the model, they are just there for evaluation, not for generating error or gradients or something that are important to the training process.

There are a lot of various metrics and it's not a easy job for users to define such metrics with the pure tensor language. Perhaps we should rethink the implementation of metrics, decouple the metric from the computation graph, make it possible to use numpy/python to finish such tasks.

How do you think? @fchollet

BTW, finally I wrote a callback function and call model.predict in the callback for computing MAP. It works well, but this method can just work for a fixed validation data and cannot get the metric for training data. If more information is provided in Callback, perhaps it's a possible solution. And here is the code:

class MAP_eval(Callback):
    def __init__(self, validation_data):
        self.validation_data = validation_data
        self.maps = []

    def eval_map(self):
        x_val, y_true = self.validation_data
        y_pred = self.model.predict(x_val)
        y_pred = list(np.squeeze(y_pred))
        zipped = zip(y_true, y_pred)
        zipped.sort(key=lambda x:x[1],reverse=True)

        y_true, y_pred = zip(*zipped)
        k_list = [i for i in range(len(y_true)) if int(y_true[i])==1]
        score = 0.
        r = np.sum(y_true).astype(np.int64)
        for k in k_list:
            Yk = np.sum(y_true[:k+1])
            score += Yk/(k+1)
        score/=r
        return score

    def on_epoch_end(self, epoch, logs={}):
        score = self.eval_map()
        print "MAP for epoch %d is %f"%(epoch, score)
        self.maps.append(score)

Most helpful comment

Maybe we could add something like model.add_metric(tensor) like we do with losses. Thoughts?

All 30 comments

I wanted to do a similar thing, and i decided to explicitly pass the training and validation sets to my Metrics Callback. I don't like it but it works for me.

MetricsCallback.py

class MetricsCallback(Callback):
    def __init__(self, metrics, validation_data, test_data):
        super().__init__()
        self.validation_data = validation_data
        self.test_data = test_data
        self.metrics = metrics
        ...

and then do

metrics_callback = MetricsCallback(validation_data=(X_val, y_val),
                                   test_data=(X_test, y_test),
                                   metrics=["macro_f1", "macro_recall"])

nn_model.fit(X_train, y_train, validation_data=(X_val, y_val),
             nb_epoch=150, batch_size=128,
             callbacks=[metrics_callback])

@cbaziotis
Yes it does work. But you can't access the training data after each batch or epoch, can you? If these information was provided in callbacks, we can rewrite the metrics module and implement them with Callbacks. I think it's a possible solution.

Anyway, what I want to say is that the metrics should not be a part of the computation graph, thus the current implementation is inappropriate.

@MoyanZitto

Yes it does work. But you can't access the training data after each batch or epoch, can you?

Of course you can. I showed you how above. You pass them to your Callback. Here is an example where you explicitly pass your train_data and validation_data:

MetricsCallback

class MetricsCallback(Callback):
    def __init__(self, train_data, validation_data):
        super().__init__()
        self.validation_data = validation_data
        self.train_data = train_data

    def on_epoch_end(self, epoch, logs={}):

        X_train = self.train_data[0]
        y_train = self.train_data[1]

        X_val = self.validation_data[0]
        y_val = self.validation_data[1]

        # do whatever you want next


**Model**

```python
metrics_callback = MetricsCallback(train_data=(X_train, y_train),
                                   validation_data=(X_val, y_val))

nn_model.fit(X_train, y_train, validation_data=(X_val, y_val),
             nb_epoch=150, batch_size=128,
             callbacks=[metrics_callback])

But i agree that this is not so nice. I did it like that because i want to use my own metrics and @fchollet said in a comment (https://github.com/fchollet/keras/issues/3230#issuecomment-233022003) that you should use a callback for that.

What i don't understand is why not use scickit-learn's metrics in the first place. What i would really like is a way to pass a scorer created with make_scorer in the metrics argument in compile(), like that:

scorer = make_scorer(f1_score, labels=['positive', 'negative'], average='macro')

model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=[scorer])

@cbaziotis
Well, if you are using fit_generator or using validation_split,it dosen't work.
I mean, the callback should be able to access the training samples in each forward process, not pass them by hand. To do this, the code in _fit_loop, the class Callback should be modified properly, and the metrics part in .compile should be removed.

If @fchollet agree to adjust the structure, I can help to write some PRs.

Using scikit-learn metrics is a good idea, there are also lots of modules that could be used in Keras, K-fold cross validation for example. Although it will add another dependence to keras, I think it is worth to. However, @fchollet is the boss, it's up to him.

@MoyanZitto
I'm confused about the MAP metric you propose. Your implementation looks like the formula for AP@all. So isn't it AP (average precision) instead of MAP(mean average precision)?

On the other hand I'm confused why the result does not match the sklearn.metrics.average_precision_score:

import numpy as np
from sklearn.metrics import average_precision_score
y_true = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.4, 0.35, 0.8])

def MAP(np_y_true, np_y_pred):

    zipped = zip(np_y_true, np_y_pred)
    zipped.sort(key=lambda x:x[1],reverse=True)

    np_y_true, np_y_pred = zip(*zipped)
    k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1]
    score = 0.
    r = np.sum(np_y_true).astype(np.int64)
    for k in k_list:
        Yk = np.sum(np_y_true[:k+1])
        score += Yk/(k+1)

    score/=r
    return score

print MAP(y_true, y_pred)
#---> 0.5
print average_precision_score(y_true, y_pred)
#---> 0.791666666667

@lebavarois
This code is wrote based on the algorthm given by a friend, the documents said it's name is "MAP".

But yes, it looks more like AP@all. Here is the algorithm:

  1. rank the probabilities you predicted from high to low.
  2. If the number of true postive samples in top k predictions is Y_k, then define P@k as:
    P@k=Y_k/k
  3. Assume the indexes of postive samples are k1,k2...kr,where r is the total number of positive samples. then MAP is defined as:
    MAP=sum(P@k)/r

I hope I didn't write a wrong code...

@MoyanZitto
I found out why the values did not match in my previous post.

1) I got the wrong value in python2, for python3 it is correct.
score += Yk/(k+1) should be score += Yk/float(k+1) in python2.

2) There seems to be a problem in the latest sklearn version. With the code of this pull request, I get the same value as in your code.

import numpy as np
from sklearn.metrics import average_precision_score

y_true = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.4, 0.35, 0.8])

def AP(np_y_true, np_y_pred):

    zipped = zip(np_y_true, np_y_pred)
    zipped.sort(key=lambda x:x[1],reverse=True)

    np_y_true, np_y_pred = zip(*zipped)
    k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1]
    score = 0.
    r = np.sum(np_y_true).astype(np.int64)
    for k in k_list:
        Yk = np.sum(np_y_true[:k+1])
        score += Yk/float(k+1)

    score/=r
    return score


print AP(y_true, y_pred)
#---> 0.8333
print average_precision_score(y_true, y_pred)
#---> 0.8333

Still I think it is AP and not MAP. If you have a multi-label classification (e.g. one image with multiple class-labels), AP should give you the evaluation for just one test datapoint (e.g. one test image). If you then have multiple test datapoints (e.g. multiple images) you can compute the mean for the whole test set, which is then MAP. So in this case y_true and y_pred should be 2 dimensional (multiple outputs) and the AP function should be applied to the second dimension. Finally the mean is taken over the first dimension (which corresponds to the datapoints, e.g. multiple images).

Maybe for other use-cases it makes more sense to implement it the way you did (if the class-label is part of the model-input and you have just one model-output, e.g. recommendation system). What kind of data are you evaluating?

@lebavarois
well, this is the metric given by some data mining compition documents. It is a binary classification problem here~

I am really appreciate your response, it's very clear and helpful. Thank you very much for the explaination.

Perhaps we're a little away from subject. let's come back to the main topic~ my point is, although we can implement our own metrics by defining a callback, we needn't to do this. Here are the reasons:

  • Metrics were set for evaluating the performance of our model, not producing gradients, so it needn't to be a part of the computation map, that's why we can remove it from the compile process.

  • Removing metrics from the computation map will make it easier for users to define their metrics with more complicated logic, or to use the functions provided by other packages such as scikit-learn, that's why we want to do this.

Please feel free to correct me if I was wrong.

@MoyanZitto
I think the evaluation might be faster if it is part of the computation graph. For the few measures where computing it with keras backend is not possible, there is still the workaround you suggested using callbacks.

For the sorting you need in the MAP it would be helpful to have some sorting function in keras backend. I think it could be accomplished using theano.tensor.sort for theano and tf.nn.top_k (which also supports sorting) in tensorflow.

@lebavarois
I see, yes, it would be faster. I know theano.tensor.sort, it's....hard to understand and harder to verify if you're using it correctly.

Anyway, speed is a reasonable reason. Although compared with the training cost, I don't think the evaluation process is very time-consuming. And in my opinion, feasibility is more important than efficiency, in addtion, computation speed has never been an advantage for keras.

Assume considering the need of evaluation efficiency, we want to retain the current implementation, I still think at least we should add another callback class(Just use scikit-learn metrics), so that the users would have a possibility to choose. If the metric is very complicated and we don't really care the training time, they could choose the callbacks version.

Bringing this up again now that some of the metrics were removed from the latest release.
It would be really great if we can come up with a good callback class that enables evaluation outside of the computation graph. Anyone already wrote something like this?

I'm trying to implement a metric function for f-score used in BIO tagging scheme. Really not so easy ... As @MoyanZitto said, decouple the metric from the computation graph would be a good thing, since f-score for various scenarios has already been implemented in other packages using Numpy.

@cbaziotis
When I try your code to calculate the training loss, I found it different from the loss presented by Keras.`
x_train = self.train_data[0]
y_train = self.train_data[1]

    x_val = self.validation_data[0]
    y_val = self.validation_data[1]
    y_train_pre = self.model.predict(x_train)
    kvar_pred = K.variable(y_train_pre)
    kvar_true = K.variable(y_train)
    my_loss = K.mean(K.categorical_crossentropy(kvar_pred,kvar_true))
    print("\nmy train loss")`

I was really confused.
The loss presented by Keras is about 0.02. However, when I printed out my_loss, I found it was about 1.00. Do you have any idea about this problem?

I'm okay with writing metrics with callbacks, as long as I were able to see them properly in the progbar after an epoch is finished. I know EarlyStop and ModelCheckpoint already work if we append extra metrics to the logs dict, but ProgbarLogger just skips the extra metrics (https://github.com/fchollet/keras/blob/master/keras/callbacks.py#L291)

If there's a new float in the logs dict when it reaches on_epoch_end, why not just show it? Also, could raise a warning if a value was found and it was not a float or int.
Callbacks printing is already a little off, so if I want to show the custom metric while training I should either set verbose=2 or print some newlines before and after the custom metric printing. Printing that with ProgbarLogger would make things way easier

Since we're talking about this, and I'm not sure that was discussed elsewhere, I constantly end up having to write custom callbacks for metrics and spending extra computation time re-computing the model's output just because metrics can't operate on multiple outputs. The way we have it now, a custom metric can take a single prediction and its expected output counterpart. Say I have non mutual exclusive classification, how can I have the average "completely correct" samples using metric functions? I can't.

@fredtcaroli you can just append the metric name to self.params['metrics'] and it will show in the progbar: self.params['metrics'].append('my_custom_metric')

I also would like to be able to implement that kind of custom metrics at the expense of a bit of performance. I don't see how the workaround by @cbaziotis solves the issue, because the custom callback never receives y_pred, it only keeps references of the network input data and expected output.

Maybe we could add something like model.add_metric(tensor) like we do with losses. Thoughts?

Wasn't the whole point of the issue about detaching the custom metrics from the graph and hence be able to use y_pred and y_true as numpy arrays to have the freedom of and compatibility to the numpy ecosystem?

If you want to use numpy metrics, you don't really need keras-level
support. You can just have your training process dump saved models, then
have a different eval process load them, cal predict on your validation
data, and call a numpy metric on the output. Or any other similar workflow.

On 14 December 2017 at 14:40, Zvezdin notifications@github.com wrote:

Wasn't the whole point of the issue about detaching the custom metrics
from the graph and hence be able to use y_pred and y_true as numpy arrays
to have the freedom of and compatibility to the numpy ecosystem?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/keras-team/keras/issues/4506#issuecomment-351858315,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AArWbz3MqadjR4D4AeNWZb1a91XD7-Yrks5tAaP7gaJpZM4K8hYo
.

@zvezdin callbacks are perfectly fine for dealing with that. Only thing missing is have the last prediction available to the callback, and that shouldn't be easy (or possible even) to implement without a performance hit. That's because the internal training function never actually outputs the predicted vector. Copying that back and forth from the gpu takes time, so it's something to avoid if possible

Calling predict is a very expensive operation that should not have to be performed again—even just one extra time—in order to calculate metrics when predictions have already been made and are just inaccessible for architectural reasons.

For my use case an extra call to predict is not feasible because the data are ginormous and the additional predict would bloat train time considerably.

I think a good solution would be to expose the latest predictions from the model somehow obtained during training. If that were accessible from the model interface, then you could write a callback to perform whatever "fancy" (non-graph, Python-based) metrics calculations without having to perform expensive predict.

Maybe if I can figure out how to do this I will submit a PR but I can't imagine the work would be that difficult to expose that...

Well... If we had something like model.add_metric(tensor) then tensorflow.py_func would be an option for folks wanting to use some third-party libs to compute metrics

It's kinda limited, but should cover 90% of the use cases, I guess

I still think that unnecessarily exposing the last batch predictions is bad, but that's only my 2 cents

When this PR is merged, you can access the output by asking for it as an extra fetch using a Callback.

I put in an issue to track this from TF repo so they can be aware for their own adaptation of Keras: https://github.com/tensorflow/tensorflow/issues/21174

While this PR might help for some problems, I think fredtcaroli's suggestion might work best for a larger set. I also needed to calc some custom accuracy during training but it also needs extra informartion. (not the tensors except pred/true but like image_ocr example, some dictionary for decode purposes)
Sending all that information to tf session is kinda pointless. And making really hard for beginners.

Though I am a beginner, evaluating metrics should be out of graph execution might be best choice.
Edit: at least optionally, its better to keep those in graph for computation costs obviously.

And if extra tensors should be needed to evaluate such a metric, it should be mapped so that keras could extract the value out of session and send them whatever metric evaluation.

Btw, i had across several questions/problems while searching, most solutions are just a workaround for a specific problem and not applicable to others. So this issue might need a little more attention for the design part.

Is there a solution for this yet? I'd also love to calculate a custom metric outside computation graph.

Any update on this ?
I would love a Callback where you have access to the predictions.
Having access to the predictions already made (train or validation) would avoid to recompute them with .predict which can be very costly like @evictor pointed out.

Like many others, I would also love a way to write metrics "outside of the graph" by taking in input numpy ndarrays and not Tensors.

To add another thought to this, as a possible workaround for now, I think
it would be possible to attach to model output tensors, then do whatever
one wants with the outputs at runtime (e.g. log them), no? Or put a pass
through custom layer at the end of the model that passes outputs
undisturbed but also "logs" them however you like (file, database, etc.).
It's not semantically the cleanest solution but it's not so bad in the
interim. Your custom layer could choose to log only at train time, for
example, and be a no-op during inference/production.

I would have already built this for my team and open sourced but I haven't
had the time...

On Tue, Jul 16, 2019 at 9:22 AM ismael-elatifi notifications@github.com
wrote:

Any update on this ?
I would love a Callback where you have access to :

  • input data
  • labels
  • predictions

Having access to the predictions already made (train or validation) would
avoid to recompute them with .predict which can be very costly like
@evictor https://github.com/evictor pointed out.

Like many others, I would also love a way to write metrics "outside of the
graph" by taking in input numpy ndarrays.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/keras-team/keras/issues/4506?email_source=notifications&email_token=AAMLRPAEKL56HOSABQ6HIRLP7XYWBA5CNFSM4CXSCYUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2BMNUQ#issuecomment-511887058,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMLRPDB3L3HHZFYQLS2PGTP7XYWBANCNFSM4CXSCYUA
.

--

  • Zeke

I would like to use opencv in my metrics and track them while I run the graph.
It is unnecessary to force users to use callbacks for this. There is no benefit at all to have metrics in the graph. Usually python is just fine and fast enough for metrics.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

amityaffliction picture amityaffliction  Â·  3Comments

MarkVdBergh picture MarkVdBergh  Â·  3Comments

kylemcdonald picture kylemcdonald  Â·  3Comments

vinayakumarr picture vinayakumarr  Â·  3Comments

fredtcaroli picture fredtcaroli  Â·  3Comments