Ray: [tune] Ray Tune Analysis doesn't work like in the documentation

Created on 12 Jul 2020 · 9Comments · Source: ray-project/ray

What is the problem?

Following the example in the documentation:

>>> tune.run(my_trainable, name="my_exp", local_dir="~/tune_results")
>>> analysis = ExperimentAnalysis(
>>>     experiment_checkpoint_path="~/tune_results/my_exp/state.json")

Throws the warning No self.trials. Drawing logdirs from checkpoint file. This may result in some information that is out of sync, as checkpointing is periodic.. After, any attempt to use the ExperimentAnalysis methods results in:

----> 1 ea.get_best_logdir('accuracy')

~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_logdir(self, metric, mode, scope)
    308                 based on `mode`, and compare trials based on `mode=[min,max]`.
    309         """
--> 310         best_trial = self.get_best_trial(metric, mode, scope)
    311         return best_trial.logdir if best_trial else None
    312 

~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_trial(self, metric, mode, scope)
    244         best_trial = None
    245         best_metric_score = None
--> 246         for trial in self.trials:
    247             if metric not in trial.metric_analysis:
    248                 continue

TypeError: 'NoneType' object is not iterable

It only works when using tune.run (as the trials are passed explicitly by it) but when trying to analyze logs of experiments, there is no way to populate self.trials.

Ray version and other system information (Python version, TensorFlow version, OS):
python 3.7.4, ray 0.9.0.dev0, ubuntu 18.04

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

It's the pytorch trainable example with logging, here goes:

from __future__ import print_function

import argparse
import os
import torch
import torch.optim as optim

import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.examples.mnist_pytorch import (train, test, get_data_loaders,
                                             ConvNet)

# Change these values if you want the training to run quicker or slower.
EPOCH_SIZE = 512
TEST_SIZE = 256

# Training settings
parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
parser.add_argument(
    "--use-gpu",
    action="store_true",
    default=False,
    help="enables CUDA training")
parser.add_argument(
    "--ray-address", type=str, help="The Redis address of the cluster.")
parser.add_argument(
    "--smoke-test", action="store_true", help="Finish quickly for testing")


# Below comments are for documentation purposes only.
# yapf: disable
# __trainable_example_begin__
class TrainMNIST(tune.Trainable):
    def setup(self, config):
        use_cuda = config.get("use_gpu") and torch.cuda.is_available()
        self.device = torch.device("cuda" if use_cuda else "cpu")
        self.train_loader, self.test_loader = get_data_loaders()
        self.model = ConvNet().to(self.device)
        self.optimizer = optim.SGD(
            self.model.parameters(),
            lr=config.get("lr", 0.01),
            momentum=config.get("momentum", 0.9))

    def step(self):
        train(
            self.model, self.optimizer, self.train_loader, device=self.device)
        acc = test(self.model, self.test_loader, self.device)
        return {"mean_accuracy": acc}

    def save_checkpoint(self, checkpoint_dir):
        checkpoint_path = os.path.join(checkpoint_dir, "model.pth")
        torch.save(self.model.state_dict(), checkpoint_path)
        return checkpoint_path

    def load_checkpoint(self, checkpoint_path):
        self.model.load_state_dict(torch.load(checkpoint_path))


# __trainable_example_end__
# yapf: enable

if __name__ == "__main__":
    args = parser.parse_args()
    ray.init(address=args.ray_address, num_cpus=6 if args.smoke_test else None)
    sched = ASHAScheduler(metric="mean_accuracy")
    analysis = tune.run(
        TrainMNIST,
        local_dir='test',
        scheduler=sched,
        stop={
            "mean_accuracy": 0.95,
            "training_iteration": 3 if args.smoke_test else 20,
        },
        resources_per_trial={
            "cpu": 3,
            "gpu": int(args.use_gpu)
        },
        num_samples=1 if args.smoke_test else 20,
        checkpoint_at_end=True,
        checkpoint_freq=3,
        config={
            "args": args,
            "lr": tune.uniform(0.001, 0.1),
            "momentum": tune.uniform(0.1, 0.9),
        })

Then:

In [1]: from ray import tune                                                                                          

In [2]: e = tune.ExperimentAnalysis('test/TrainMNIST/experiment_state-2020-07-11_16-49-24.json')                      
No `self.trials`. Drawing logdirs from checkpoint file. This may result in some information that is out of sync, as checkpointing is periodic.

In [3]: e.get_best_config('mean_accuracy') 
TypeError                                 Traceback (most recent call last)
<ipython-input-3-a39abca74dac> in <module>
----> 1 e.get_best_config('mean_accuracy')

~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_config(self, metric, mode, scope)
    286                 based on `mode`, and compare trials based on `mode=[min,max]`.
    287         """
--> 288         best_trial = self.get_best_trial(metric, mode, scope)
    289         return best_trial.config if best_trial else None
    290 

~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_trial(self, metric, mode, scope)
    244         best_trial = None
    245         best_metric_score = None
--> 246         for trial in self.trials:
    247             if metric not in trial.metric_analysis:
    248                 continue

TypeError: 'NoneType' object is not iterable

If we cannot run your script, we cannot fix your issue.

[x] I have verified my script runs in a clean environment and reproduces the issue.
[ ] I have verified the issue also occurs with the latest wheels.

P3 bug tune

Source

ksanjeevan

👍1