Ignite: TrainsLogger example + tutorial

Created on 21 May 2020 · 13Comments · Source: pytorch/ignite

🚀 Feature - Add TrainsLogger Example + Tutorial

I think it makes sense to add a TrainsLogger example, something similar to what we already have here in the Pascal VOC example
From what I understand, what is missing is a few entries in the common_training and maybe an additional example script here.

Then a quick how-to tutorial , including some references to setting up the trains-server / configuration, maybe similar to this one

What do you think?

Source

bmartinn

👍1

All 13 comments

Thanks @bmartinn for this suggestion 😊

Would you be interested to contribute on that ?

sdesrozis on 21 May 2020

@sdesrozis, Of course with pleasure :)

Did I miss anything? or should I go ahead and start diving into it

bmartinn on 21 May 2020

@bmartinn good 😊

I think that you have catched the idea of how loggers are plugged in this example 👍🏻 IMO you can go forward 😊

@vfdev-5 agreed ?

sdesrozis on 21 May 2020

👍1

@bmartinn yes, that's correct, we need

specific script for trains here : https://github.com/pytorch/ignite/tree/master/examples/references/segmentation/pascal_voc2012/code/scripts . For example, training_with_trains.py or another name if you have a better idea :)
note on how to use with Trains in reproducible manner
templates or usable files for Trains-Agent etc like here : https://github.com/pytorch/ignite/tree/master/examples/references/segmentation/pascal_voc2012/experiments
- for MLflow we provide MLproject that user can run as is.
- for Polyaxon, only template files are possible as some part of them describe user's infrastructrure to use.
- for Trains I suppose it should be something similar ?

vfdev-5 on 21 May 2020

Thanks @vfdev-5
All is well, with the exception of the templates, I don't think you need anything additional to the code to make it reproducible :)

It's basically out-of-the-box experience, once you run the code it creates it's own "template" in the trains-server from that point, you can clone/enqueue for execution and you will get the exact same run (installed by the trains-agent). The automatically created "template" already contains the packages you need and specifies the recommended python version.

Do you think we should include the trains-agent installation as part of the readme ?
I guess if we have reproducibility in mind then we should, I just want to make sure it is clear this is an option and not a must. Make sense ?

bmartinn on 21 May 2020

@bmartinn thanks for the details !

Do you think we should include the trains-agent installation as part of the readme ?

We can just put a link on your documentation :)

It's basically out-of-the-box experience, once you run the code it creates it's own "template" in the trains-server from that point

I think we need to give all details about that such that this automatical system is transparent for user and he/she could easily reproduce the trainings for fixed versions, update versions if needed etc.

@sdesrozis I thinks we also need to make some updates on our code, especially :
https://github.com/pytorch/ignite/blob/0452e4199ae0da627274f563a901493d9654b80d/examples/references/segmentation/pascal_voc2012/code/scripts/common_training.py#L23
Get rid of these args with_mlflow_logging=False, with_plx_logging=False and refactor loggers creation...

vfdev-5 on 21 May 2020

👍1

@vfdev-5

I think we need to give all details about that such that this automatical system is transparent for user and he/she could easily reproduce the trainings for fixed versions, update versions if needed etc.

Yes I think you are right, I'll make sure the documentation (i.e. readme) describes the mechanism so it's less magic more automation :)

bmartinn on 21 May 2020

👍1

@vfdev-5 quick question, it seems that there is no way to change the "config" values after calling config.setup(), well actually you can change them but it has no effect. For example changing the batch_size will not actually control the batch size, as the function that does the data pipe was already constructed in the config.py
Is this correct? Am I missing something here?

bmartinn on 26 May 2020

@bmartinn yes, your understanding is correct about it. Otherwise, I think, it will be rather difficult to trace all modifications for reproducible training.

vfdev-5 on 26 May 2020

👍1

Hi @vfdev-5, I have a draft for the PR here.
I took the liberty of adding a requirements.txt for the sample project, it seemed to be that the only way to get the list of packages was inside the conda yaml file in the mlflow directory :)

What do you think?

bmartinn on 28 May 2020

@bmartinn looks good ! I think I'll need to follow your readme to reexecute the training and check how it works. There are some points to fix, but seems like OK.
About requirements.txt, in some sense, yes, maybe it would be better create a common list of them. The problem was with Polyaxon when we build training docker image requirements file was not yet available or something (maybe now it's changed).

I was also thinking about a solution to setup Trains server for ignite and make it read-only available for users to see training logs online. Probably, I would need some support from you guys, as locally my docker-compose fails to make up some docker images (permanent restart). What would be a better communication support for that, your slack, email, github ?

vfdev-5 on 28 May 2020

@vfdev-5 just a quick FYI, for the smoothest experience, I suggest waiting for the next RC (should be available some time next week), we added some improvements that were also needed here #1056