Ray: [tune] Ray Tune Analysis doesn't work like in the documentation

Created on 12 Jul 2020  路  9Comments  路  Source: ray-project/ray

What is the problem?

Following the example in the documentation:

>>> tune.run(my_trainable, name="my_exp", local_dir="~/tune_results")
>>> analysis = ExperimentAnalysis(
>>>     experiment_checkpoint_path="~/tune_results/my_exp/state.json")

Throws the warning No self.trials. Drawing logdirs from checkpoint file. This may result in some information that is out of sync, as checkpointing is periodic.. After, any attempt to use the ExperimentAnalysis methods results in:

----> 1 ea.get_best_logdir('accuracy')

~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_logdir(self, metric, mode, scope)
    308                 based on `mode`, and compare trials based on `mode=[min,max]`.
    309         """
--> 310         best_trial = self.get_best_trial(metric, mode, scope)
    311         return best_trial.logdir if best_trial else None
    312 

~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_trial(self, metric, mode, scope)
    244         best_trial = None
    245         best_metric_score = None
--> 246         for trial in self.trials:
    247             if metric not in trial.metric_analysis:
    248                 continue

TypeError: 'NoneType' object is not iterable

It only works when using tune.run (as the trials are passed explicitly by it) but when trying to analyze logs of experiments, there is no way to populate self.trials.

Ray version and other system information (Python version, TensorFlow version, OS):
python 3.7.4, ray 0.9.0.dev0, ubuntu 18.04

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

It's the pytorch trainable example with logging, here goes:

from __future__ import print_function

import argparse
import os
import torch
import torch.optim as optim

import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.examples.mnist_pytorch import (train, test, get_data_loaders,
                                             ConvNet)

# Change these values if you want the training to run quicker or slower.
EPOCH_SIZE = 512
TEST_SIZE = 256

# Training settings
parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
parser.add_argument(
    "--use-gpu",
    action="store_true",
    default=False,
    help="enables CUDA training")
parser.add_argument(
    "--ray-address", type=str, help="The Redis address of the cluster.")
parser.add_argument(
    "--smoke-test", action="store_true", help="Finish quickly for testing")


# Below comments are for documentation purposes only.
# yapf: disable
# __trainable_example_begin__
class TrainMNIST(tune.Trainable):
    def setup(self, config):
        use_cuda = config.get("use_gpu") and torch.cuda.is_available()
        self.device = torch.device("cuda" if use_cuda else "cpu")
        self.train_loader, self.test_loader = get_data_loaders()
        self.model = ConvNet().to(self.device)
        self.optimizer = optim.SGD(
            self.model.parameters(),
            lr=config.get("lr", 0.01),
            momentum=config.get("momentum", 0.9))

    def step(self):
        train(
            self.model, self.optimizer, self.train_loader, device=self.device)
        acc = test(self.model, self.test_loader, self.device)
        return {"mean_accuracy": acc}

    def save_checkpoint(self, checkpoint_dir):
        checkpoint_path = os.path.join(checkpoint_dir, "model.pth")
        torch.save(self.model.state_dict(), checkpoint_path)
        return checkpoint_path

    def load_checkpoint(self, checkpoint_path):
        self.model.load_state_dict(torch.load(checkpoint_path))


# __trainable_example_end__
# yapf: enable

if __name__ == "__main__":
    args = parser.parse_args()
    ray.init(address=args.ray_address, num_cpus=6 if args.smoke_test else None)
    sched = ASHAScheduler(metric="mean_accuracy")
    analysis = tune.run(
        TrainMNIST,
        local_dir='test',
        scheduler=sched,
        stop={
            "mean_accuracy": 0.95,
            "training_iteration": 3 if args.smoke_test else 20,
        },
        resources_per_trial={
            "cpu": 3,
            "gpu": int(args.use_gpu)
        },
        num_samples=1 if args.smoke_test else 20,
        checkpoint_at_end=True,
        checkpoint_freq=3,
        config={
            "args": args,
            "lr": tune.uniform(0.001, 0.1),
            "momentum": tune.uniform(0.1, 0.9),
        })

Then:

In [1]: from ray import tune                                                                                          

In [2]: e = tune.ExperimentAnalysis('test/TrainMNIST/experiment_state-2020-07-11_16-49-24.json')                      
No `self.trials`. Drawing logdirs from checkpoint file. This may result in some information that is out of sync, as checkpointing is periodic.

In [3]: e.get_best_config('mean_accuracy') 
TypeError                                 Traceback (most recent call last)
<ipython-input-3-a39abca74dac> in <module>
----> 1 e.get_best_config('mean_accuracy')

~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_config(self, metric, mode, scope)
    286                 based on `mode`, and compare trials based on `mode=[min,max]`.
    287         """
--> 288         best_trial = self.get_best_trial(metric, mode, scope)
    289         return best_trial.config if best_trial else None
    290 

~/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py in get_best_trial(self, metric, mode, scope)
    244         best_trial = None
    245         best_metric_score = None
--> 246         for trial in self.trials:
    247             if metric not in trial.metric_analysis:
    248                 continue

TypeError: 'NoneType' object is not iterable

If we cannot run your script, we cannot fix your issue.

  • [x] I have verified my script runs in a clean environment and reproduces the issue.
  • [ ] I have verified the issue also occurs with the latest wheels.
P3 bug tune

All 9 comments

I think you'll need to use the Analysis class instead of the ExperimentAnalysis class - https://docs.ray.io/en/master/tune/api_docs/analysis.html#id1

can you try that out and let me know if that works?

analysis = tune.Analysis('test/TrainMNIST/')

Hey @richardliaw, using tune.Analysis does work, although it doesn't have all the get_best_ type methods. Also since it's in the documentation I was under the impression I was doing something wrong. Does creating an ExperimentAnalysis from the logs not work at all then?

No, it currently does not - let me push a fix in this upcoming week. Thanks for opening this issue!

By the way, just wondering - what are you trying to do exactly? Is there a reason you want to call this after the tune.run has finished?

I wanted to be able to get the same object returned by tune.run easily, mainly for past experiments where I want to retrieve the best config/performance without re-running anything.

Currently I'm using Analysis and then working off the dataframe() but it's a bit ugly so was just wondering (this isn't a huge issue, just figured people might be confused by the doc!).

got it; thanks! I've filed an issue to make this UX better. Please let me know if you spot any other issues or have any other feature suggestions!

Hope this can get fixed soon, as it seems we still have no way to read the trials back from the log, despite it seems the trials are actually saved in the log. When the experiment is submitted as a job to a managed cluster, there is no way to get the analysis object immediately after calling tune.run(). It is unbelievable such an oversight occurred - maybe the developers are using a different workflow?

seems I found a workaround:
analysis = tune.run(my_trainable, local_dir="~/tune_results", resume=True)

Was this page helpful?
0 / 5 - 0 ratings