The current implementation of metrics assumes a very specific set of return values from the training/validation function: (y_pred, y), as can be seen for example in https://github.com/pytorch/ignite/blob/master/ignite/metrics/mean_absolute_error.py#L20 .
This has a couple inconvenient consequences:
state.outputsIt would be reasonable to standardize the set of return values expected from training/validation functions if the user would have the option of putting custom values in the calling engine's state. This would require exposing state to the training/validation functions. This would solve (2) but metrics would need to be further adjusted to solve (1).
One way of standardizing the return values would be to always return only loss, model outputs, and targets. If state is ephemeral but is passed to the training/validation function, this would be of the form:
return loss, model_outputs, targets, state
If state were to be exposed in an Engine as described in #117, this would be of the form:
return loss, model_outputs, targets
Every element in model_outputs would be paired with each element in the targets. This would allow a Metric to optionally track a single output in a multi-output model, changing
class Metric(object):
def __init__(self):
self.reset()
def iteration_completed(self, engine, state):
self.update(state.output)
to
class Metric(object):
def __init__(self, output_index=None):
self.output_index = output_index
self.reset()
def iteration_completed(self, engine, state):
output = state.output
target = state.target
if self.output_index is not None:
output = output[self.output_index]
target = target[self.output_index]
self.update(output, target)
I would say that in this special case of computing a metric on a particular index of the output/target, you can write a custom class, sort of proxy metric, IndexedMetric (find a better name):
class ProxyMetric(Metric):
def __init__(self, metric):
self.metric = metric
super(IndexedMetric, self).__init__()
def reset(self):
self.metric.reset()
def compute(self):
return self.metric.compute()
class IndexedMetric(ProxyMetric):
def __init__(self, metric, index):
self.index = index
super(IndexedMetric, self).__init__(metric)
def update(self, output):
y_pred, y = output
y_pred = y_pred[self.index]
y = y[self.index]
self.metric.update((y_pred, y))
HTH
Yes, one could write a custom wrapper in some cases and custom metrics in other cases; however it would be preferable to have the provided metrics be commonly applicable. Having more than one model output is not a very special case. The metrics code also strongly restricts what can go in the return values of an update function - so let's discuss what should go there.
One of the points I'm trying to put forth is that now that state is tracked in an engine, all custom results of an update function could be stored in the state while the return values of the update function could be standardized. This would make any code that depends on these return values, such as metrics code, be possible to implement.
Already, the metrics code makes a strong assumption about the form of the return values. Let's determine a good, flexible form for update function return values, then -- and allow everything else to be put in the state.
I'm also having issues with this very restricted output.
I usually like to return model outputs in a dict, but that doesn't work with current Metric/Loss implementation.
From this discussion, I think it makes sense to pass state into the update functions so users can store (and manage) whatever state they want on that object. This would also remove the need for returning arbitrary things from these functions.
As for a standardized output, I think these would look different for the training update function and the evaluate function (i.e. different for each engine). The main purpose of the Evaluator is to evaluate a dataset against Metrics so it makes sense that we standardize the output of the evaluation functions to be possibly y, y_pred which is the form of the pytorch nn loss functions.
For the trainer's update function, I think the return value is not important, but it's important to note that it will be stored in state.output for the last iteration.
How does that sound @veugene @versatran01?
Also, @veugene - how do you currently (outside of ignite) evaluate your model outputs using pytorch's loss functions?
Or there could be a user-supplied handler that decodes whatever input to the desired form for Metric to process. It will default to y_pred, y = output.
it makes sense that we standardize the output of the evaluation functions to be possibly
y, y_predwhich is the form of the pytorch nn loss functions.
The pytorch loss functions are written for an input of the form y, y_pred but not necessarily state.output; that is, it's the job of the meta-framework to flexibly match common patterns (models with one or more outputs) to the framework's functions (pytorch's per-output loss functions).
As for a standardized output, I think these would look different for the training update function and the evaluate function (i.e. different for each engine).
As long as the user can put what they need in state, that's fine. I would find it conceptually simpler though if both update functions were to have the same return pattern.
For the trainer's update function, I think the return value is not important, but it's important to note that it will be stored in state.output for the last iteration.
I always compute metrics on both the training data and validation data, to get a sense of how the training is progressing without being limited to the sometimes very vague overall loss. This is also the common pattern in other frameworks and github repos that I've seen. I think that Metric should be usable just as easily with Trainer as with Evaluator, in the same way.
@veugene You bring up some great points. I definitely agree that we can make a few changes here to make things easier for everyone. Couple things...
On your last point... This is a failing on my part for not getting the docs together yet for this. The ignite approach to this is to use an Evaluator for both your training data and your validation data. The Trainer is for training and the Evaluator is for evaluating. But there is no reason you can't evaluate your model against a training set. If you take a look at the original metrics PR, it has an example of just this. Does that address that concern?
I think your point of having access to state within the update function is a good one. I'm convinced. We just need to come to an agreement for how it gets passed in there. So assuming we have access to state in the update function where the user can put whatever they want on it, how would you feel if we kept the current return signatures from update? Keeping in mind that this is only for supervised models, I think assuming a return of y, y_pred is reasonable. However, we do need to address your point of multiple outputs. My preference (hattip to @versatran01) would be to have an optional map_fn that can be passed to all supervised metrics at instantiation time. This would default to just an identity operation but could easily be made to work for indexed outputs and dictionary outputs.
What do you think?
hmm, the problem with running the evaluator over the training set to get metrics is that you have to iterate over your training set again (this can be expensive for large datasets).
If this is the only way to collect training metrics then I guess we should revisit this.
However, if we accept metrics to the Evaluator and the Trainer then there really isn't much distinction between the two, and perhaps we should just merge them into an Engine? The only difference being the process_fn that is run in the inner loop - for which the user control.
Thoughts?
@alykhantejani The way I do this is to take a sample of the training set (that is the same size) as the evaluation set. It's the only way to compare apples to apples. Otherwise your training metrics are computed against a moving target (with changing params) while your evaluation metrics are computed against a static target.
I think the confusion here will be cleared up quite a bit once we have the docs up.
@veugene can you share an example of the metrics you are calculating when training?
@alykhantejani This is actually something that I've wanted to do as well. A (relatively) simple example would be a multi-head classification model. In that case y and y_pred could be a tuple where the first element is the label for the first classification problem and the second element is the label for the second classification problem.
Another example could be a multi-headed model where you are doing both a bounding-box regression and classification.
@jasonkriss - thanks. I think the case for multiple outputs makes a ton of sense, we should update the metrics to account for this, possibly with an output_transform which the state.transform is passed to (similar to what was in the old History code).
My main concern is metrics on the trainer:
Trainer and Evaluator or just have a single Engine?@alykhantejani I'm computing a Dice loss, evaluated per batch as if each batch is its own volume (this is better than evaluating per element in the batch as the latter has too high a variance). I am also accumulating the counts necessary to compute a dataset-wide Dice score by the end of the epoch. This second measure is like the Metric code that was recently introduced.
@jasonkriss @alykhantejani It is very common to accumulate measures over a training epoch even as the model is changing. This overall loss is typically reported as an average over all minibatches since the start of an epoch. Other measures are reported similarly. While this is not ideal for evaluating model performance, it is still informative; model performance is evaluated on the validation set, anyway.
As @alykhantejani points out, running an Evaluator on the training set requires a second pass through the data. I would never do this as it's too expensive on large datasets. The advantage to accumulating measures over an epoch in the training loop is that this is done online with minimal overhead. This is likely why this compromise is such a common pattern.
is there really a need to separate Trainer and Evaluator or just have a single Engine?
I guess you're right that if these were to be made to have the same input and output formats, they could be merged. In that case, the user would still have the option of running this merged Engine as a metric "Evaluator" on the training data in eval() mode as in the pattern presented by @jasonkriss.
@jasonkriss Yes, a mapper function may be a way to go about it. Since with access to engine state, a user could store custom values in the state, would there be cases where it is too restrictive to have a standard pattern defined for the return values of update functions? If so, the mapper function would be the more flexible solution; else, standardizing the return value pattern would appear simpler.
@veugene I think we want to standardize the return value to y, y_pred but at the same time allow flexibility as to what y and y_pred actually are (scalars, dict, tuple, etc). So for that reason, I think a mapper/transform is the way to go. I opened #121 to address this. We can discuss it more over there if need be.
So I think the only thing left to hammer out on this issue is the Trainer/Engine division... I know that computing metrics against the training set in an online fashion is a very common pattern so it may be something we need to support. It's certainly a signal. But it is a noisy one. I wouldn't want to give up the approach I described. IMO it works very well to run a true evaluation pass on a subset of your training set. The exact same way you would do so on your validation set. As long as you are using a sample and not the whole training set, this cost is pretty negligible (with model set to eval and using larger batch sizes).
I doubt I'll be able to convince everyone of this so maybe we add support for it. However, I'm not yet convinced we should give up the Trainer/Evaluator split.
@veugene thanks for the clarification of your use-case.
We can still keep Trainer and Evaluator separate if we feel it will give cleaner code (plus flexibility in the future to do trianer/evaluator specific things). But if/when (it seems like we should) add metrics to the Trainer the only difference is that the Evaluator only runs one pass of the dataset and does not emit EPOCH_STARTED and EPOCH_COMPLETED events.
Happy to hear pros and cons of keeping the two classes over merging them into a single Engine?
Also, thanks for opening #121 @jasonkriss - let's move discussions related to the metrics API there.
I think we want to standardize the return value to y, y_pred but at the same time allow flexibility as to what y and y_pred actually are (scalars, dict, tuple, etc).
Alright -- the flexibility to use dictionaries for labeled outputs is indeed compelling.
I opened #121 to address this.
Thanks :)
So I think the only thing left to hammer out on this issue is the Trainer/Engine division...
Currently, to make the evaluator go over a subset of the training data, one still needs to set up an appropriate data iterator that presents some random subset. Since Evalutor doesn't handle this itself, while this strategy could be a good one, I think it could be achieved with about the same code burden regardless of whether Evaluator and Trainer are merged. Even if they are merged, a clear set of examples in the documentation, recommending this strategy as an option to users, would go a long way.
Evaluator only runs one pass of the dataset and does not emit
EPOCH_STARTEDandEPOCH_COMPLETEDevents.
Yes, that is a key difference. However, a Trainer-like unified Engine could be set to have max_epochs=1 by default. So then, would emitting EPOCH_STARTED and EPOCH_COMPLETED be problematic on an evaluation run?
Also yes, I too would love to see what may be potential differences between Evaluator and Trainer that may be implemented later.
@alykhantejani @veugene Thinking it over a bit more, I think you guys may actually be right. I was thinking it'd be best to keep the separation mainly from the perspective of metrics and the events we attach them to. However, If we were to unify them into a single engine, we could then update this to use EPOCH_STARTED and this to use EPOCH_COMPLETED. Metrics would then work for both training and evaluating and the code would be simplified quite a bit.
A "Trainer" would then be an Engine with a "training" update function (and likely max_epochs > 1) and an Evaluator would simply be an Engine with an "evaluation" update function (and the default max_epochs=1. I don't think there is a problem with the EPOCH_STARTED and EPOCH_COMPLETED events being emitted by an "evaluator".
I actually think this works out quite well. In that case, I'd probably be in favor of unifying them into a single engine and then if at some point down the line there is a compelling use case to separate them again, we can. But in the meantime, having a single Engine would be nice and simple and easier from user-education standpoint.
What do you all think?
@jasonkriss @alykhantejani Yes I like that
Yes, Trainer and Evaluator can be more factorized in this way that there is an Engine that can run multiple times some processing on the data.
Are we also missing a Predictor ?
@vfdev-5 I suppose the main loop of a Predictor could be an Engine whose "update function" does not compute the loss but simply returns outputs, computed with no_grad() in eval() mode. What remains is collecting the outputs. Outputs could then be accumulated by a handler from state (possibly with an optional function for mapping outputs as in #121).
Then, if there were to be a Predictor alias, it would be just a convenience function that creates an Engine with such a handler.
I agree with the statements above, let's go ahead and merge Trainer and Evaluator into one and make metrics work for the Engine :)
Thanks for starting the fruitful discussions @veugene !
Metrics now support multiple outputs with #121 being merged.
closing as Trainer and Evaluator have also been merged in #122
Having a predictor would be nice. it could be something pretty easy but would be worth for being complete:
def create_predictor(model,
prepare_batch=_prepare_batch):
def _inference(engine, batch):
model.eval()
with torch.no_grad():
x = prepare_batch(batch)
y_pred = model(x)
return y_pred
engine = Engine(_inference)
for name, metric in metrics.items():
metric.attach(engine, name)
return engine
@zippeurfou yes, something like that but without metrics:
def _prepare_prediction_batch(batch):
x, ids = batch
return (convert_tensor(x, device=device, non_blocking=non_blocking), ids)
def create_predictor(model, device=None, non_blocking=False,
prepare_batch=_prepare_prediction_batch):
if device:
model.to(device)
def _inference(engine, batch):
model.eval()
with torch.no_grad():
x, ids = prepare_batch(batch, device=device, non_blocking=non_blocking)
y_pred = model(x)
return y_pred, ids
engine = Engine(_inference)
return engine
Anyway, please create an issue for that, if you think that can be useful in the library and we could discuss.
Most helpful comment
I agree with the statements above, let's go ahead and merge
TrainerandEvaluatorinto one and make metrics work for theEngine:)Thanks for starting the fruitful discussions @veugene !