Ignite: TrainsSaver doesn't respect Checkpoint's n_saved

Created on 20 May 2020 · 34Comments · Source: pytorch/ignite

🐛 Bug description

As the title says, it seems that TrainsSaver bypasses the Checkpoint n_saved parameter. That means that all models are saved and never updated / deleted.

Consider this simple example:

        task.phases['train'].add_event_handler(
            Events.EPOCH_COMPLETED(every=1),
            Checkpoint(to_save, TrainsSaver(output_uri=output_uri), 'epoch', n_saved=1,
                       global_step_transform=global_step_from_engine(task.phases['train'])))

The above saves every checkpoint. You end-up with

epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
epoch_checkpoint_3.pt
...

Now if we do, the same with DiskSaver:

        task.phases['train'].add_event_handler(
            Events.EPOCH_COMPLETED(every=1),
            Checkpoint(to_save, DiskSaver(dirname=dirname), 'epoch', n_saved=1,
                       global_step_transform=global_step_from_engine(task.phases['train'])))

We get only:

epoch_checkpoint_3.pt

as expected.

Same behaviour if we save only best models using score_function, i.e. TrainsSaver saves every best model.

Environment

PyTorch Version: 1.3.1
Ignite Version: 0.4.0.dev20200519 (EDIT: update to latest nightly, issue still exists)
OS: Linux
How you installed Ignite: pip nightly
Python version: 3.6
Any other relevant information: trains version: 0.14.3

bug

Source

achigeor

Most helpful comment

Thanks @vfdev-5 ! kudos for the quick PR!
We will add it into TrainsServer (probably needs to add a bit of support in Trains as well)
I'll update here once the PR is ready

bmartinn on 23 May 2020

👍3

All 34 comments

Thanks for the report @achigeor ! Let us reproduce the issue and have deeper look on where is the problem.

cc @jkhenning

vfdev-5 on 20 May 2020

👍1

Checked the number of temporary saved checkpoints and only one remains in temp folder. Probably, the problem is that we need to explicitly remove them from Trains' storage: override
remove method
https://github.com/pytorch/ignite/blob/0452e4199ae0da627274f563a901493d9654b80d/ignite/handlers/checkpoint.py#L370-L372

@jkhenning do you have an API to remove stored artifact ?

vfdev-5 on 20 May 2020

Hi @achigeor ,

When you say you see:

epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
epoch_checkpoint_3.pt
...

Do you mean you see them in the Trains UI models page, or somewhere else? As @vfdev-5 said, I believe these files will not be present in the dirname folder (or the temp folder used if dirname is not provided)

jkhenning on 20 May 2020

Hey @jkhenning,

I mean I see them in the directory I pass as output_uri. E.g. if you pass output_uri="checkpoints", you will see these checkpoints in checkpoints/trains_project_name/trains_task_name.task_id/models/:

|   +-- <project-name>
|       +-- <task-name>.<task-Id>
|           +-- models
|               +-- <epoch_checkpoint_1.pt>
|               +-- <epoch_checkpoint_2.pt>
|               +-- <epoch_checkpoint_3.pt>

In the Trains UI, you would only see the _last_ model model saved in artifacts/Output Model, i.e task_name-epoch_checkpoint_3.pt.

EDIT: I also checked the temp folder. It indeed only keeps the last one. However, the problem is that trains _uploads_ all the checkpoints in the output_uri location defined - in this case a local directory.

Following is a fully reproducible example, based on the mnist example:

from argparse import ArgumentParser

import torch
import torch.nn.functional as F
from torch import nn
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import Compose, ToTensor

from ignite.contrib.handlers.trains_logger import *
from ignite.engine import Events, create_supervised_trainer
from ignite.handlers import Checkpoint, DiskSaver

LOG_INTERVAL = 10


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=-1)


def get_data_loaders(train_batch_size):
    data_transform = Compose([ToTensor()])

    train_loader = DataLoader(
        MNIST(download=True, root=".", transform=data_transform, train=True), batch_size=train_batch_size, shuffle=True
    )

    return train_loader


def run(train_batch_size, epochs, lr=0.01):
    train_loader = get_data_loaders(train_batch_size)
    model = Net()
    device = "cpu"

    if torch.cuda.is_available():
        device = "cuda"

    model.to(device)  # Move model before creating optimizer
    optimizer = SGD(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    trainer = create_supervised_trainer(model, optimizer, criterion, device=device)

    TrainsLogger(project_name="trains-ignite", task_name="test-checkpoints")

    handler = Checkpoint(
        {"model": model},
        TrainsSaver(output_uri="test_checkpoints"),
        # DiskSaver(dirname="test_checkpoints", require_empty=False),
        n_saved=1,
        filename_prefix="epoch",
        global_step_transform=global_step_from_engine(trainer),
    )

    trainer.add_event_handler(Events.EPOCH_COMPLETED, handler)
    trainer.run(train_loader, max_epochs=epochs)


if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--batch_size", type=int, default=32, help="input batch size for training (default: 64)")
    parser.add_argument("--epochs", type=int, default=3, help="number of epochs to train (default: 10)")

    args = parser.parse_args()

    run(args.batch_size, args.epochs)

I just removed the unnecessary stuff, focusing on the checkpoints.

You can comment TrainsSaver or DiskSaver out, here:

    handler = Checkpoint(
        {"model": model},
        TrainsSaver(output_uri="test_checkpoints"),
        # DiskSaver(dirname="test_checkpoints", require_empty=False),
        n_saved=1,
        filename_prefix="epoch",
        global_step_transform=global_step_from_engine(trainer),
    )

Let me know if that helps. I just run the above and reproduced the behaviour I explained on the first message.

achigeor on 21 May 2020

Hey @achigeor

I see what you mean, and you're quite correct about Trains "uploading" the model files to the output_uri which in this case is a folder.
The thing is, output_uri is also (most commonly) used to represent remote storage services, like an S3 bucket, in which cases supporting a delete action might add additional constraints (requiring the credentials to allow these permissions) or introduce a considerable amount of risk (automated processes with delete permission for buckets containing "production" grade data) - for us, the compromise there was to take into account the relatively low cost of cloud storage and forego support for delete in this case.

An approach we did take in other cases, which is much more conservative, is reusing resource names, so if for example you'd like no more than the 3 last checkpoints to be stored, we can cycle through uploaded checkpoint file names, and upload as follows:

epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
epoch_checkpoint_3.pt
epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
...

As you can see, checkpoint 4 will simply overwrite old checkpoint 1 (and checkpoint 5 will override checkpoint 2), and assuming you're listing the files in modified order (or viewing through the Trains UI) - you still get the last 3 checkpoints, in the correct order (luckily, write access also mean overwrite access :)

This can be easily implemented in this case in the TrainsSaver itself (with a little help from Trains).

What do you say?

jkhenning on 21 May 2020

An approach we did take in other cases, which is much more conservative, is reusing resource names, so if for example you'd like no more than the 3 last checkpoints to be stored, we can cycle through uploaded checkpoint file names

@jkhenning maybe this approach is not suitable for epoch_checkpoint_<N>.pt or best model file best_model_val_score=<0.XYZ>.pt where <N> indicates training epoch and <0.XYZ> best validation score. We can not cycle on those values, otherwise they do not make sense...

About output_uri and dirname args of TrainSaver shouldn't we limit to only one provided possible (in case of when output_uri is a local storage) ?

If I understand correctly in case of @achigeor the checkpoints are stored twice: at temp dirname and at output_uri local storage ?

vfdev-5 on 21 May 2020

@vfdev-5 if I may chime in.

We are using the filename as a way to add meta-data on the model weight itself.

Maybe the best usage for TrainsLogger would be to store that meta-data as a comment on the model, this way it is also searchable? (we could also use the model name in the Trains system and store the meta-data there)
Another option is using the model configuration entry (that can store anything, and add the meta-data there) see example

This removes the need to preserve the exact "filename" in the system, as the way to look for it would be to go to the Trains UI and search for the Model with specific score or checkpoint. Obviously you can also get it from a previously executed Task, including meta-data.

What do think?

p.s.
We are thinking on adding a more specific key/value to the model object, so one could add more diverse information.

bmartinn on 21 May 2020

Hey @vfdev-5

Well, I agree it makes little sense to store twice, however I'm not sure detecting when output_uri is local and when not is a good approach (from experience, likely to break on edge-cases).

I would say we should instruct users to use dirname for local storage, and output_uri for remote storage, but this means users who'd like to test their code with output_uri before switching to a remote storage will still see the same side-effects @achigeor noticed...

jkhenning on 21 May 2020

@bmartinn thanks for the idea ! If I understand correctly your suggestions is to :

pick filename which is need to be stored, e.g. epoch_checkpoint_1.pt and setup some meta-data
save checkpoint on Trains system with another name that we can rewrite in a cycle manner as proposed by @jkhenning
=> on Trains UI we have a good number of stored artifacts and their meta-data contain the name we intendeded to store ?

@jkhenning, could you please detail why we need both arguments actually ? Sorry, I become a bit lost with how it work with Trains...
Naively, can't we keep one argument, e.g. output_uri?

vfdev-5 on 21 May 2020

@vfdev-5 yes exactly!
a minor detail though:

on Trains UI we have a good number of stored artifacts and their meta-data contain the name we intended to store ?

Trains separates artifacts from Models. Basically artifacts you store once, and are tied with an experiment, where as Models you store multiple times (e.g. snapshots) and have their own entity. This means we can easily track models between experiments. Unfortunately Artifacts already received the meta-data feature, and Models are still missing it, but I'm hoping it is coming soon ;)

To the point, models have names that are unrelated to the experiment that created them, (kind of like Artifacts). We can basically use these names as the meta-data. Notice that a user can always choose to rename a model, even after the experiment is completed, with no effect on the actual stored file. This gives you the possibility of renaming your model (in the UI) from "best_model_val_score=<0.XYZ>.pt" to "this is the chosen one! score 0.999.pt"

bmartinn on 21 May 2020

👍1

Hi all,

What's not clear to me regarding output_uri is what @jkhenning mentioned about remote saving. We will indeed use S3 to save our models and artifacts - we're now finalizing integrating Trains in our workflow and were using a local dir as output_uri for testing.
But from my understanding, trains would behave the same regardless, i.e, what I'm experiencing locally will still happen if I store in S3. Is there a mechanism that I miss that makes this behaviour relevant only for local storing?

To give some perspective, my goal is to store a checkpoint every N epochs keeping the last K (as kind of a "backup") as well as the best model(s). With DiskSaver this is straightforward to do.

One solution could be to name the models checkpoint and best_model as @jkhenning suggested, but this won't work as @vfdev-5 pointed out, since afaik ignite adds a relevant suffix under the hood and no option to disable it yourself (please @vfdev-5 correct me if I'm wrong on this one). If this can work, it's fine for my usecase.

Regarding trains and output models I like the idea of adding metadata to them, especially since it will allow to programmatically query based on them (instead of having to do torch.load() to read from the checkpoint dictionary).

achigeor on 21 May 2020

One solution could be to name the models checkpoint and best_model as @jkhenning suggested, but this won't work as @vfdev-5 pointed out, since afaik ignite adds a relevant suffix under the hood and no option to disable it yourself (please @vfdev-5 correct me if I'm wrong on this one). If this can work, it's fine for my usecase.

@achigeor you are right, there is no easy way to store checkpoints without suffix. Maybe we can create a feature request for that inside Checkpoint. However, meta-data (e.g. epoch index or score) may be missing in the .pt file.

vfdev-5 on 21 May 2020

Hey @achigeor

But from my understanding, trains would behave the same regardless, i.e, what I'm experiencing locally will still happen if I store in S3. Is there a mechanism that I miss that makes this behaviour relevant only for local storing?

You're absolutely right - that was my point, that Trains currently handles remote and local storage locations exactly the same, without using any delete-like feature.

@vfdev-5

@jkhenning, could you please detail why we need both arguments actually ? Sorry, I become a bit lost with how it work with Trains...
Naively, can't we keep one argument, e.g. output_uri?

Well, the original intent of keeping dirnamewas to keep part of the DiskSaver interface since we need a local storage where checkpoint are stored prior to being uploaded anyway (if you remember, we create a temp dir in case dirname was not provided). I agree this represents some kind of duplication since output_uri (which is essential since it allows the user to provide a remote storage for Trains to use) also supports local storage.
So I guess the _actual_ original intent was _"Use dirname for local storage and output_uri for remote storage. If you don't use dirname, we'll use a temp dir instead"_ - but maybe that was not reflected very well in the documentation, and it's possible it's too complicated and we should just decide we'll drop dirname altogether and always use a temp dir...

jkhenning on 21 May 2020

Hey @achigeor

But from my understanding, trains would behave the same regardless, i.e, what I'm experiencing locally will still happen if I store in S3. Is there a mechanism that I miss that makes this behaviour relevant only for local storing?

You're absolutely right - that was my point, that Trains currently handles remote and local storage locations exactly the same, without using any delete-like feature.

Thanks got it now! I have this followup question then:

Is there a specific reason that uploading happens while the task is running? What I see happening now is:

Checkpoint with TrainsSaver writes to /tmp or dirname, respecting n_saved param
TrainsSaver, through trains task's output_uri uploads the model from /tmp to output_uri everytime a model is saved in the previous step

Would it be possible to instead upload the contents of the /tmp directory _after_ the task is finished? This way you upload only once, instead of everytime you save a model while the training task is running.

achigeor on 21 May 2020

Would it be possible to instead upload the contents of the /tmp directory after the task is finished?

@achigeor this can be possible, but we should keep in mind that we can loose everything if script crashes before the task is finished and we didn't catch that.

vfdev-5 on 21 May 2020

👍2

From what I understand, Checkpoint is doing the filename creation, and it seems that "_" is a hard-coded separator.

In case output_uri is passed, we could parse the filename, extract the filename_prefix/name and use that as the base filename for upload, or we can use a hard-coded fixed name.

This way if output_uri is not passed, the behavior is consistent with DiskSaver (as is already the case now), and if output_uri is passed, the Checkpoint filename (e.g. {filename_prefix}_{name}_{global_step}_{score_name}={score}.{ext}) will be used as the Trains Model name (i.e. in the UI), making it intuitive to look for.

What do you guys think?

bmartinn on 22 May 2020

In case output_uri is passed, we could parse the filename, extract the filename_prefix/name and use that as the base filename for upload, or we can use a hard-coded fixed name.

Well, both approaches are not ideal, but between parsing and hard-coded fixed name, IMO, hard-coded fixed name probably better... especially if in UI full correct filename is used.

vfdev-5 on 22 May 2020

@vfdev-5 I agree, I think hard-coded is probably safer, but we should have a way to differentiate between two models being stored, for example GAN training. We brings me back to parsing...
How about we inherit Checkpoint, that way we know what's the prefix and can easily make those decisions. What do you think?

bmartinn on 22 May 2020

👍1

I was also thinking about if we can pass more info to save_handler in addition to object_to_save and filename:
https://github.com/pytorch/ignite/blob/c012166f93e56f8e9538741f5745a5010983ba38/ignite/handlers/checkpoint.py#L21

For example, we can opt to pass some meta-info about the checkpoint to save:

class BaseSaveHandler(metaclass=ABCMeta):
    """Base class for save handlers"""

    @abstractmethod
    def __call__(self, checkpoint: Mapping, filename: str, metadata=None) -> None:
        pass

and in metadata we can pass prefix, name and all scores which compose the filename.

This certainly requires minor API change for DiskSaver and other savers. However, we recently introduced BaseSaveHandler as base class for savers, so we still can change thing now...

vfdev-5 on 22 May 2020

👍3

@vfdev-5 I like it! I think it makes a lot of sense. That means the DiskSaver can gain more insight into the model being stored (so for example, if the model score is lower then the previous, it can only store it locally and choose to skip the upload).

bmartinn on 23 May 2020

Yes, for a cloud disk savers or other more complex savers where we can set some metadata it can be helpful and avoid parsing the filename :)

so for example, if the model score is lower then the previous, it can only store it locally and choose to skip the upload

Normally, it is the role of Checkpoint with score function etc.

So, let us create a FR for that.

vfdev-5 on 23 May 2020

🚀1

@bmartinn PR with metadata is landed. So, you may try to incorporate it into TrainsSaver and give a feedback on that....

vfdev-5 on 23 May 2020

🚀1

Thanks @vfdev-5 ! kudos for the quick PR!
We will add it into TrainsServer (probably needs to add a bit of support in Trains as well)
I'll update here once the PR is ready

bmartinn on 23 May 2020

👍3

@vfdev-5 Great idea, and 👍 for pushing things so quickly - once we'll add the required support in trains I'll open a PR 😄

jkhenning on 23 May 2020

👍1

@bmartinn as 0.15.0 is out from your side, what can be done here to fix this issue ?

vfdev-5 on 4 Jun 2020

@jkhenning could you help with a PR fix?

bmartinn on 4 Jun 2020

👍1

@bmartinn, @vfdev-5 I'll get on it :)

jkhenning on 4 Jun 2020

👍2

Hi @vfdev-5,

I've pushed a draft into my fork, branch fix_trains_checkpoint_n_saved (see here) - I'd appreciate your input.

Basically, I changed the TrainsSaver behavior to use the new Trains pre/post registration/upload callbacks, which always uses the filename as passed to the saver (which removed the need to have a special case for atomic mode).

However, in order to preserve the n_saved behavior which keeps the n checkpoints with the highest priority, the logic inside the pre-save callback had to be aware of the n_saved value stored in the checkpoint, as well as the list of saved files and their priorities - this is required in order for the callback to know which previously registered/uploaded file must be overwritten. Even though I could have achieved that by inheriting the Checkpoint class, it seems a bit too much - I was wondering if you think we can simply expose these values in the Checkpoint's interface (for read, obviously, not write), or whether you have a better idea :)

jkhenning on 6 Jun 2020

@jkhenning thanks for working on that !
Could you please confirm me that overriding DiskSaver.remove method can not help for that ?
For example, TrainsSaver receives checkpoints with the filenames and metadata:

T1 : save: training_checkpoint_100.pt meta1
---
T2 : save: training_checkpoint_200.pt meta2
---
T3 : save: training_checkpoint_300.pt meta3
---
T4 : save: training_checkpoint_400.pt meta4
T4 : remove: training_checkpoint_100.pt
---
T5 : save: training_checkpoint_500.pt meta5
T5 : remove: training_checkpoint_200.pt
---
...

if you store somewhere a mapping between training_checkpoint_X.pt and the name you save n checkpoints to store on server, maybe it could be possible to override remove: training_checkpoint_X.pt and perform necessary action ?
What do you think ?

PS: I tried to reach out you on allegro slack to talk about this in more fluent way.

vfdev-5 on 6 Jun 2020

I think the reason remove might not be enough, is that it will basically be too late.
In other words if n_saved equals to 5, the first time the TrainsLogger will see a remove called will be after the 6th model was created, so essentially although you wanted the "last" 5 models, TrainsLogger will store the "last" 6.

@vfdev-5 makes sense ?

bmartinn on 6 Jun 2020

@vfdev-5 I've updated my fork's branch to match the changes we made in Trains to support this feature - we plan to release an RC after this weekend, at which point I'll open a PR :)

jkhenning on 10 Jun 2020

👍1

@jkhenning thanks for the update ! It can work for us. We have released our v0.4.0 RC and meanwhile will work on remaining tasks.

Yes, I saw the modification in your branch. So, it remains only the switch of save/remove order if we would like to have n_saved instead of n_saved+1 checkpoints.

cc @erip

vfdev-5 on 10 Jun 2020

Hi @vfdev-5

So, it remains only the switch of save/remove order if we would like to have n_saved instead of n_saved+1 checkpoints.

Indeed :)

jkhenning on 11 Jun 2020

Hi @vfdev-5

So, it remains only the switch of save/remove order if we would like to have n_saved instead of n_saved+1 checkpoints.

Indeed :)

@jkhenning done !

vfdev-5 on 12 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Saving double execution cost during training

CreateRandom · 3Comments

Add integration with tensorboard?

alykhantejani · 3Comments

Fix mypy issue with pytorch nightly

vfdev-5 · 3Comments

Label-wise metrics (Accuracy etc.) for multi-label problems

jphdotam · 4Comments

running average with for loop

Aiden-Jeon · 3Comments