I use pytorch lightning + wandb logger. I do not know how to extract history of training (training losses, validation losses...) from pytorch lightning or from the logger.
In the docs, the logger has a property (experiment) returning a wandb run object
https://pytorch-lightning.readthedocs.io/en/latest/loggers.html?highlight=wandb#weights-and-biases
this Run object looks like the run object described here
https://docs.wandb.com/library/reference/wandb_api#run
but it is missing some elements (no member called state) and missing some funcionalities. For example, the member history is not callable, so I cannot obtain information about the training history.
I don't understand. The logger.experiment is what wandb.init returns, the Run object. How are you trying to access these attributes?
I have something like this to declare the logger and the trainer (WandbLogger is imported from pytorch_lightning.loggers)
wandb_logger = WandbLogger(...)
trainer = pl.Trainer(
...
logger=wandb_logger,
)
Then, after training, I try to get information from the Run object, which is wandb_logger.experiment, but I fail to obtain what I want (history):
$ wandb_logger.experiment
<wandb.sdk.wandb_run.Run object at 0x7f08576174f0>
$ wandb_logger.experiment.history
<wandb.sdk.wandb_history.History object at 0x7f0857617580>
$ wandb_logger.experiment.history()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'History' object is not callable
Alternatively, I also access the same Run object by importing wandb and accessing wandb.run:
$ import wandb
$ wandb.run
<wandb.sdk.wandb_run.Run object at 0x7f08576174f0>
Please note that some public attributes are missing (created_at, history_keys, state...; compare with https://docs.wandb.com/library/reference/wandb_api#run):
$[m for m in dir(wandb_logger.experiment) if not m.startswith('_')]
['config', 'dir', 'entity', 'finish', 'get_url', 'history', 'id', 'join', 'log', 'log_artifact', 'name', 'notes', 'path', 'project_name', 'restore', 'resumed', 'save', 'start_time', 'starting_step', 'stderr_redirector', 'stdout_redirector', 'step', 'summary', 'tags', 'url', 'use_artifact', 'watch']
This is happening on multi gpu, yes?
No, I do not think so, I have a single gpu and I give "gpus=1," option to my trainer
This is happening on multi gpu, yes?
Hello, I'm experiencing the similar thing and I'm using multi GPU.
run = trainer.logger.experiment
print(f"Ending run: {str(run.id)}")
translated into
Ending run: <bound method DummyExperiment.nop of <pytorch_lightning.loggers.base.DummyExperiment object at 0x7fb784f186d0>>
The full callback code is below
class WandbArtifactCallback(pl.Callback):
def on_train_end(self, trainer, pl_module):
run = trainer.logger.experiment
print(f"Ending run: {str(run.id)}")
artifact = wandb.Artifact(f"{str(run.id)}_model", type="model")
for path, val_loss in trainer.checkpoint_callback.best_k_models.items():
print(f"Adding artifact: {path}")
artifact.add_file(path)
run.log_artifact(artifact)
def on_keyboard_interrupt(self, trainer, pl_module):
self.on_train_end(trainer, pl_module)
For multi-gpu (@kyoungrok0517's case) I have an answer. The logger in multi-gpu only runs on the gpu 0. This is to avoid problems with file io, bottlenecks and multiprocessing in general. If you use the logger objects
self.logger.experiment, we only return the real object in process 0, and on the others a dummy experiment. This is so that your code does not break if you do things like self.logger.experiment.log_images(...) and on process > 0 it simply becomes a no op.
In your case, you want to create an artifact, so I suggest you do this on process 0 only, like this:
class WandbArtifactCallback(pl.Callback):
def on_train_end(self, trainer, pl_module):
if trainer.global_rank > 0: # <------------------ add this
return
# ... your custom logger code
def on_keyboard_interrupt(self, trainer, pl_module):
self.on_train_end(trainer, pl_module)
For your case @fjhheras, not sure yet what's happening. I will try to reproduce.
Now I figured why the outcome was inconsistent. Thanks!
This isn't a pytorch lightning issue, it is just a quirk of the wandb api. They have two different run objects, wandb.wandb_run.Run is returned by wandb.init and handles the logging but there is also wandb.apis.public.Run which is for reading data after the run is complete.
Here is a repo of your problem without pytorch lightning.
import wandb
api = wandb.Api()
run = wandb.init()
for i in range(4):
run.log({"i": i})
run_path = run.path
print("type(run): ", type(run))
print("run.history: ", type(run.history))
read_access_run = api.run(run_path)
print("type(read_access_run): ", type(read_access_run))
print("type(read_access_run.history()): ", type(read_access_run.history()))
This outputs:
type(run): <class 'wandb.wandb_run.Run'>
run.history: <class 'wandb.history.History'>
type(read_access_run): <class 'wandb.apis.public.Run'>
type(read_access_run.history()): <class 'pandas.core.frame.DataFrame'>
All you need to do is take the path from the experiment on the logger and pass is to api.run and that will create the run object you are looking for.
run_path = wandb_logger.experiment.path
read_access_run = api.run(run_path)
@Tim-Chard Thanks for the explanation. I don't follow completely but would like to confirm if we need to do anything in our wandb wrapper.
@kyoungrok0517 @fjhheras did this suggestion from Tim work for you?
@awaelchli Essentially there are two Run classes in the wandb api that look that similar (they both have history for example) but they are for different things. The issue was caused by trying to use one in place of the other.
So to confirm, there are no changes required in the wrapper to address this issue.
@awaelchli Yes, @Tim-Chard explanation makes sense and what he proposes works. Thank you!
Unfortunately @Tim-Chard proposal only works when I am online (wandb on). As it is purely a wandb issue now, I opened a question there (https://github.com/wandb/client/issues/1308)
Most helpful comment
This isn't a pytorch lightning issue, it is just a quirk of the wandb api. They have two different run objects,
wandb.wandb_run.Runis returned bywandb.initand handles the logging but there is alsowandb.apis.public.Runwhich is for reading data after the run is complete.Here is a repo of your problem without pytorch lightning.
This outputs:
All you need to do is take the path from the experiment on the logger and pass is to api.run and that will create the run object you are looking for.