Ray: [tune] AWS Spot Instance Automation

Created on 27 Jul 2020  路  5Comments  路  Source: ray-project/ray

I'm using the following python tune script:

hpo.py

# Install and import libraries
import time
from lightgbm import LGBMClassifier
import pandas as pd
import ray
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from tune_sklearn import TuneSearchCV

# Start timer
start = time.time()

# Initialize Ray
ray.init(address='auto')

# Load breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize estimator
model = LGBMClassifier()

# Initialize parameter distributions
param_dists = {
    'boosting_type': ['gbdt'],
    'colsample_bytree': (0.8, 0.9, 'log-uniform'),
    'reg_alpha': (1.1, 1.3),
    'reg_lambda': (1.1, 1.3),
    'min_split_gain': (0.3, 0.4),
    'subsample': (0.7, 0.9),
    'subsample_freq': (20, 21)
}

# Initialize tuner
tuner = TuneSearchCV(
    model,
    param_dists,
    n_iter=20,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=2,
    max_iters=10,
    search_optimization='bayesian',
    use_gpu=True,
)

# Tune hyperparameters
tuner.fit(X_train, y_train)
print('Best Parameters :', tuner.best_params_)

# Get cross-validated results
df_cv = pd.DataFrame(tuner.cv_results_)

# Predict using best hyperparameters
y_pred = tuner.predict(X_test)
print('F1 Score:', f1_score(y_test, y_pred, average='weighted'))

# Get elapsed time
end = time.time()
print('Elapsed Time :', (end - start))

# Shutdown Ray
ray.shutdown()

and the following configuration file:

tune-default-hpo.yaml

cluster_name: tune-default
provider: {type: aws, region: us-east-2}
auth: {ssh_user: ubuntu}
min_workers: 3
max_workers: 3

head_node:
    InstanceType: c5.xlarge
    ImageId: ami-08bf49c7b3a0c761e

    # Run workers on spot by default. Comment this out to use on-demand.
    InstanceMarketOptions:
        MarketType: spot
        SpotOptions:
            MaxPrice: "1"  # Max Hourly Price

# Provider-specific config for worker nodes, e.g. instance type.
worker_nodes:
    InstanceType: m5.large
    ImageId: ami-08bf49c7b3a0c761e

    # Run workers on spot by default. Comment this out to use on-demand.
    InstanceMarketOptions:
        MarketType: spot
        SpotOptions:
            MaxPrice: "1"  # Max Hourly Price

setup_commands: # Set up each node.
    - pip install lightgbm ray scikit-optimize torch torchvision tabulate tensorboard tune_sklearn

Command: ray submit tune-default-hpo.yaml hpo.py --start --stop

Questions:

  1. Ray is not terminating the spot instances if the connection is lost. Is there a way to configure ray to terminate them automatically?
  2. Also is there a python API to deploy the instances instead of running the command on the terminal?
  3. How do I configure Ray to deploy on a single spot instance and not a cluster?
  4. How can I set the storage size for the spot instances?
question

All 5 comments

Thanks for making this issue @rohan-gt!

Ray is not terminating the spot instances if the connection is lost. Is there a way to configure ray to terminate them automatically?

What do you mean connection is lost? Maybe ray submit --stop?

Also is there a python API to deploy the instances instead of running the command on the terminal?

Unfortunately, it's not public-facing.

How do I configure Ray to deploy on a single spot instance and not a cluster?

min_workers: 0
max_workers: 0

How can I set the storage size for the spot instances?

worker_nodes:
    InstanceType: m4.16xlarge
    ImageId: ami-0def3275  # Default Ubuntu 16.04 AMI.

    # Set primary volume to 250 GiB
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 250

@richardliaw I tried deploying the cluster from my local PC using the following command:
ray submit tune-default-hpo.yaml hpo.py --start --stop

But if either the submitted Python script has an error or if I close the terminal mid execution or if my internet connection is lost, the spot instance cluster is not shutdown

I later tried passing ray submit --stop which gives the error Error: Missing argument 'CLUSTER_CONFIG_FILE'. while ray submit tune-default-hpo.yaml hpo.py --stop errors out with Command 'ray' not found, did you mean: because I closed the terminal before Ray was installed on the head node.

Is there a way to:

  1. Automatically reconnect and finish execution?
  2. Teardown the cluster if the connection was closed mid-execution and the head node is idle beyond a set time?

Sorry for the slow reply:

ray submit tune-default-hpo.yaml hpo.py --start --tmux --stop

allows you to teardown the cluster automatically after the job is finished AND does not require the internet connection to be kept on your laptop.

Does that help?

Hmm this still requires the ray package to be present on the head node before it can shutdown right? Because when I tried closing the terminal mid pip installation of the ray package it didn't auto shutdown. But sure if this is how it is that's good enough

Great point; yeah, to clarify, this requires Ray to have started on the head-node (so ray up and ray start needs to have finished).

Was this page helpful?
0 / 5 - 0 ratings