Following the example in the documentation:
>>> tune.run(my_trainable, name="my_exp", local_dir="~/tune_results")
>>> analysis = ExperimentAnalysis(
>>> experiment_checkpoint_path="~/tune_results/my_exp/state.json")
Throws the warning No self.trials. Drawing logdirs from checkpoint file. This may result in some information that is out of sync, as checkpointing is periodic.. After, any attempt to use the ExperimentAnalysis methods results in:
----> 1 ea.get_best_logdir('accuracy')
~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_logdir(self, metric, mode, scope)
308 based on `mode`, and compare trials based on `mode=[min,max]`.
309 """
--> 310 best_trial = self.get_best_trial(metric, mode, scope)
311 return best_trial.logdir if best_trial else None
312
~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_trial(self, metric, mode, scope)
244 best_trial = None
245 best_metric_score = None
--> 246 for trial in self.trials:
247 if metric not in trial.metric_analysis:
248 continue
TypeError: 'NoneType' object is not iterable
It only works when using tune.run (as the trials are passed explicitly by it) but when trying to analyze logs of experiments, there is no way to populate self.trials.
Ray version and other system information (Python version, TensorFlow version, OS):
python 3.7.4, ray 0.9.0.dev0, ubuntu 18.04
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
It's the pytorch trainable example with logging, here goes:
from __future__ import print_function
import argparse
import os
import torch
import torch.optim as optim
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.examples.mnist_pytorch import (train, test, get_data_loaders,
ConvNet)
# Change these values if you want the training to run quicker or slower.
EPOCH_SIZE = 512
TEST_SIZE = 256
# Training settings
parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
parser.add_argument(
"--use-gpu",
action="store_true",
default=False,
help="enables CUDA training")
parser.add_argument(
"--ray-address", type=str, help="The Redis address of the cluster.")
parser.add_argument(
"--smoke-test", action="store_true", help="Finish quickly for testing")
# Below comments are for documentation purposes only.
# yapf: disable
# __trainable_example_begin__
class TrainMNIST(tune.Trainable):
def setup(self, config):
use_cuda = config.get("use_gpu") and torch.cuda.is_available()
self.device = torch.device("cuda" if use_cuda else "cpu")
self.train_loader, self.test_loader = get_data_loaders()
self.model = ConvNet().to(self.device)
self.optimizer = optim.SGD(
self.model.parameters(),
lr=config.get("lr", 0.01),
momentum=config.get("momentum", 0.9))
def step(self):
train(
self.model, self.optimizer, self.train_loader, device=self.device)
acc = test(self.model, self.test_loader, self.device)
return {"mean_accuracy": acc}
def save_checkpoint(self, checkpoint_dir):
checkpoint_path = os.path.join(checkpoint_dir, "model.pth")
torch.save(self.model.state_dict(), checkpoint_path)
return checkpoint_path
def load_checkpoint(self, checkpoint_path):
self.model.load_state_dict(torch.load(checkpoint_path))
# __trainable_example_end__
# yapf: enable
if __name__ == "__main__":
args = parser.parse_args()
ray.init(address=args.ray_address, num_cpus=6 if args.smoke_test else None)
sched = ASHAScheduler(metric="mean_accuracy")
analysis = tune.run(
TrainMNIST,
local_dir='test',
scheduler=sched,
stop={
"mean_accuracy": 0.95,
"training_iteration": 3 if args.smoke_test else 20,
},
resources_per_trial={
"cpu": 3,
"gpu": int(args.use_gpu)
},
num_samples=1 if args.smoke_test else 20,
checkpoint_at_end=True,
checkpoint_freq=3,
config={
"args": args,
"lr": tune.uniform(0.001, 0.1),
"momentum": tune.uniform(0.1, 0.9),
})
Then:
In [1]: from ray import tune
In [2]: e = tune.ExperimentAnalysis('test/TrainMNIST/experiment_state-2020-07-11_16-49-24.json')
No `self.trials`. Drawing logdirs from checkpoint file. This may result in some information that is out of sync, as checkpointing is periodic.
In [3]: e.get_best_config('mean_accuracy')
TypeError Traceback (most recent call last)
<ipython-input-3-a39abca74dac> in <module>
----> 1 e.get_best_config('mean_accuracy')
~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_config(self, metric, mode, scope)
286 based on `mode`, and compare trials based on `mode=[min,max]`.
287 """
--> 288 best_trial = self.get_best_trial(metric, mode, scope)
289 return best_trial.config if best_trial else None
290
~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_trial(self, metric, mode, scope)
244 best_trial = None
245 best_metric_score = None
--> 246 for trial in self.trials:
247 if metric not in trial.metric_analysis:
248 continue
TypeError: 'NoneType' object is not iterable
If we cannot run your script, we cannot fix your issue.
I think you'll need to use the Analysis class instead of the ExperimentAnalysis class - https://docs.ray.io/en/master/tune/api_docs/analysis.html#id1
can you try that out and let me know if that works?
analysis = tune.Analysis('test/TrainMNIST/')
Hey @richardliaw, using tune.Analysis does work, although it doesn't have all the get_best_ type methods. Also since it's in the documentation I was under the impression I was doing something wrong. Does creating an ExperimentAnalysis from the logs not work at all then?
No, it currently does not - let me push a fix in this upcoming week. Thanks for opening this issue!
By the way, just wondering - what are you trying to do exactly? Is there a reason you want to call this after the tune.run has finished?
I wanted to be able to get the same object returned by tune.run easily, mainly for past experiments where I want to retrieve the best config/performance without re-running anything.
Currently I'm using Analysis and then working off the dataframe() but it's a bit ugly so was just wondering (this isn't a huge issue, just figured people might be confused by the doc!).
got it; thanks! I've filed an issue to make this UX better. Please let me know if you spot any other issues or have any other feature suggestions!
Hope this can get fixed soon, as it seems we still have no way to read the trials back from the log, despite it seems the trials are actually saved in the log. When the experiment is submitted as a job to a managed cluster, there is no way to get the analysis object immediately after calling tune.run(). It is unbelievable such an oversight occurred - maybe the developers are using a different workflow?
seems I found a workaround:
analysis = tune.run(my_trainable, local_dir="~/tune_results", resume=True)