Ax: Slow GPEI candidate generation on Cloud-CPU

Created on 25 Jan 2021 · 12Comments · Source: facebook/Ax

ax is a good optimization library.
I am new to the ax platform, but recently discovered that when using the gpei model to generate parameters, the execution time of the program is much slower than before. ( ex: experiment.new_trial(generator_run=gpei.gen(1)) )
Don't know what the problem is?
Thank you!

question

Source

sczhang870330

All 12 comments

@sczhang870330, thank you, glad you like Ax! How much slower is "much slower"? GPEI actually fits a surrogate model and optimizes an acquisition function over it, so it's expected that it would be slower than Sobol, which does not involve any ML. Without knowing the degree of the slowdown you are referring to, it's difficult to say whether it's just the expected slowdown because of modeling or more.

lena-kashtelyan on 25 Jan 2021

Thank you for your reply.
I understand that the GPEI strategy requires fitting surrogate model, so it will be slower than the sobol strategy, but recently it is unusually slow. Using gpei to generate a parameter combination used to only take about 10 seconds, but it may take several minutes recently. , I wonder if anyone has the same problem with me? Thank you

sczhang870330 on 26 Jan 2021

@sczhang870330 , can you provide a minimum repro of your code, or at least some more details about the optimization problem you're working on?

ldworkin on 26 Jan 2021

❤1

I faced with the same problem. A simple example runs slower on a faster CPU. My benchmark code:

import numpy as np
from ax import (
    ComparisonOp,
    ParameterType, 
    RangeParameter,
    SearchSpace, 
    SimpleExperiment, 
    OutcomeConstraint, 
)
from ax.metrics.l2norm import L2NormMetric
from ax.modelbridge.registry import Models
import time

def evaluation_function(parameterization,weight=None):
    x = np.array([parameterization.get(f"x{i}") for i in range(1)])
    print("evaluation_function: ", x)
    return {"evaluation_function": (x**2, 0.0), "l2norm": (np.sqrt((x ** 2).sum()), 0.0)}

search_space = SearchSpace(
    parameters=[
        RangeParameter(
            name=f"x{i}", parameter_type=ParameterType.FLOAT, lower=0.0, upper=1.0
        )
        for i in range(1)
    ]
)

exp = SimpleExperiment(
    name="test",
    search_space=search_space,
    evaluation_function=evaluation_function,
    objective_name="evaluation_function",
    minimize=True,
    outcome_constraints=[
        OutcomeConstraint(
            metric=L2NormMetric(
                name="l2norm", param_names=[f"x{i}" for i in range(1)], noise_sd=0.2
            ),
            op=ComparisonOp.LEQ,
            bound=1.25,
            relative=False,
        )
    ],
)

sobol = Models.SOBOL(exp.search_space)
for i in range(5):
    exp.new_trial(generator_run=sobol.gen(1))
d=exp.eval()
gpei = Models.BOTORCH(experiment=exp, data=d)
print("Fit time:", gpei.fit_time)
t = time.time()
run=gpei.gen(1)
print("Gen time:", time.time()-t)

Results
My laptop:

4 core Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz:
Fit time: 0.12
Gen time: 0.30

Cloud-CPU:

0.5 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Fit time: 6.72
Gen time: 55.05

3 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Fit time: 25.64
Gen time: 49.90

6 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Fit time: 12.38
Gen time: 21.08

I cann't figure out what is the case of such difference. May be there are somethings with Cloud, but difference in 100x is very strange. Simple (one-thread) CPU benchmark doesn't show such a difference:

4 core Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz:
Avarage: 3.578s

0.5 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Avarage: 8.854s

3 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Avarage: 2.997s

Any sudgestions how to debug it?

valentin7121 on 5 Feb 2021

cc @Balandat , were there any underlying changes in PyTorch recently that might have caused this? It sounds like from OP that the slowdown only started recently.

ldworkin on 5 Feb 2021

Not that I am aware of. Our implementations heavily exploit vectorized computations via Intel's MKL, so if that is not available/installed correctly then you'll see significant slowdowns. @valentin7121 can you provide the MKL build setup of both your laptop and the cloud setup?

Balandat on 5 Feb 2021

@Balandat I did not install the MKL manually both on the laptop and on the cloud. I only installed ax-platform with dependecies:
pip3 install ax-platform
How can I check MKL setup?
(Laptop OS: Ubuntu 20.04.2 LTS; cloud OS: Ubuntu 18.04.3 LTS)

valentin7121 on 5 Feb 2021

On Ubuntu 18.04.3 LTS I have the error during installation:

ERROR: torchvision 0.4.1+cu100 has requirement torch==1.3.0, but you'll have torch 1.7.1 which is incompatible.
Installing collected packages: torch, gpytorch, botorch, retrying, plotly, typeguard, ax-platform
  Attempting uninstall: torch
    Found existing installation: torch 1.3.0+cu100
    Uninstalling torch-1.3.0+cu100:
      Successfully uninstalled torch-1.3.0+cu100
Successfully installed ax-platform-0.1.19 botorch-0.3.3 gpytorch-1.3.1 plotly-4.14.3 retrying-1.3.3 torch-1.7.1 typeguard-2.10.0

Could this be the reason for the slowdown?

valentin7121 on 5 Feb 2021

cc @Balandat

lena-kashtelyan on 12 Feb 2021

Could this be the reason for the slowdown?

I doubt it - this just installs a much newer pytorch version, so that should be ok.

Can you paste the output of the following?

import torch
print(torch.__config__.show())

Balandat on 14 Feb 2021

PyTorch config is the same both on the laptop and on the cloud:

PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

After rebuilding the OS image (Cloud), the problem seems to be solved. Even though the PyTorch configuration hasn't changed. Now 3-core Cloud-CPU in the same benchmark shows:

Fit time: 0.09515857696533203
Gen time: 0.5103681087493896

Thank you!

valentin7121 on 15 Feb 2021

👍1

Interesting. Might have been a one-off issue with the cloud machine. Let us know if this re-occurs.

Balandat on 15 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Does attach trial affect the first few trials to be generated?

aagarwal1999 · 9Comments

High-dimensional discrete search space

grmaier · 9Comments

Repeated trials in experiment (and numerical errors they sometimes cause: `RuntimeError: cholesky_cpu: U(63,63) is zero, singular U.`)

covrig · 21Comments

Crash while optimizing: RuntimeError: cholesky_cpu: U(1,1) is zero, singular U.

leopd · 16Comments

[Question/Issue]Not showing graphs on JupyterLab

Leonhalt3141 · 10Comments