I think it makes sense to add a TrainsLogger example, something similar to what we already have here in the Pascal VOC example
From what I understand, what is missing is a few entries in the common_training and maybe an additional example script here.
Then a quick how-to tutorial , including some references to setting up the trains-server / configuration, maybe similar to this one
What do you think?
Thanks @bmartinn for this suggestion 馃槉
Would you be interested to contribute on that ?
@sdesrozis, Of course with pleasure :)
Did I miss anything? or should I go ahead and start diving into it
@bmartinn good 馃槉
I think that you have catched the idea of how loggers are plugged in this example 馃憤馃徎 IMO you can go forward 馃槉
@vfdev-5 agreed ?
@bmartinn yes, that's correct, we need
trains here : https://github.com/pytorch/ignite/tree/master/examples/references/segmentation/pascal_voc2012/code/scripts . For example, training_with_trains.py or another name if you have a better idea :)Thanks @vfdev-5
All is well, with the exception of the templates, I don't think you need anything additional to the code to make it reproducible :)
It's basically out-of-the-box experience, once you run the code it creates it's own "template" in the trains-server from that point, you can clone/enqueue for execution and you will get the exact same run (installed by the trains-agent). The automatically created "template" already contains the packages you need and specifies the recommended python version.
Do you think we should include the trains-agent installation as part of the readme ?
I guess if we have reproducibility in mind then we should, I just want to make sure it is clear this is an option and not a must. Make sense ?
@bmartinn thanks for the details !
Do you think we should include the trains-agent installation as part of the readme ?
We can just put a link on your documentation :)
It's basically out-of-the-box experience, once you run the code it creates it's own "template" in the trains-server from that point
I think we need to give all details about that such that this automatical system is transparent for user and he/she could easily reproduce the trainings for fixed versions, update versions if needed etc.
@sdesrozis I thinks we also need to make some updates on our code, especially :
https://github.com/pytorch/ignite/blob/0452e4199ae0da627274f563a901493d9654b80d/examples/references/segmentation/pascal_voc2012/code/scripts/common_training.py#L23
Get rid of these args with_mlflow_logging=False, with_plx_logging=False and refactor loggers creation...
@vfdev-5
I think we need to give all details about that such that this automatical system is transparent for user and he/she could easily reproduce the trainings for fixed versions, update versions if needed etc.
Yes I think you are right, I'll make sure the documentation (i.e. readme) describes the mechanism so it's less magic more automation :)
@vfdev-5 quick question, it seems that there is no way to change the "config" values after calling config.setup(), well actually you can change them but it has no effect. For example changing the batch_size will not actually control the batch size, as the function that does the data pipe was already constructed in the config.py
Is this correct? Am I missing something here?
@bmartinn yes, your understanding is correct about it. Otherwise, I think, it will be rather difficult to trace all modifications for reproducible training.
Hi @vfdev-5, I have a draft for the PR here.
I took the liberty of adding a requirements.txt for the sample project, it seemed to be that the only way to get the list of packages was inside the conda yaml file in the mlflow directory :)
What do you think?
@bmartinn looks good ! I think I'll need to follow your readme to reexecute the training and check how it works. There are some points to fix, but seems like OK.
About requirements.txt, in some sense, yes, maybe it would be better create a common list of them. The problem was with Polyaxon when we build training docker image requirements file was not yet available or something (maybe now it's changed).
I was also thinking about a solution to setup Trains server for ignite and make it read-only available for users to see training logs online. Probably, I would need some support from you guys, as locally my docker-compose fails to make up some docker images (permanent restart). What would be a better communication support for that, your slack, email, github ?
@vfdev-5 just a quick FYI, for the smoothest experience, I suggest waiting for the next RC (should be available some time next week), we added some improvements that were also needed here #1056
I was also thinking about a solution to setup Trains server for ignite and make it read-only available for users to see training logs online.
Sounds like a great idea , Slack is probably more suited for that, feel free to DM me on our slack channel :)
PR #1095 merged, closing this issue.
Open if something pops.