Ax: Slow GPEI candidate generation on Cloud-CPU

Created on 25 Jan 2021  路  12Comments  路  Source: facebook/Ax

ax is a good optimization library.
I am new to the ax platform, but recently discovered that when using the gpei model to generate parameters, the execution time of the program is much slower than before. ( ex: experiment.new_trial(generator_run=gpei.gen(1)) )
Don't know what the problem is?
Thank you!

question

All 12 comments

@sczhang870330, thank you, glad you like Ax! How much slower is "much slower"? GPEI actually fits a surrogate model and optimizes an acquisition function over it, so it's expected that it would be slower than Sobol, which does not involve any ML. Without knowing the degree of the slowdown you are referring to, it's difficult to say whether it's just the expected slowdown because of modeling or more.

Thank you for your reply.
I understand that the GPEI strategy requires fitting surrogate model, so it will be slower than the sobol strategy, but recently it is unusually slow. Using gpei to generate a parameter combination used to only take about 10 seconds, but it may take several minutes recently. , I wonder if anyone has the same problem with me? Thank you

@sczhang870330 , can you provide a minimum repro of your code, or at least some more details about the optimization problem you're working on?

I faced with the same problem. A simple example runs slower on a faster CPU. My benchmark code:

import numpy as np
from ax import (
    ComparisonOp,
    ParameterType, 
    RangeParameter,
    SearchSpace, 
    SimpleExperiment, 
    OutcomeConstraint, 
)
from ax.metrics.l2norm import L2NormMetric
from ax.modelbridge.registry import Models
import time

def evaluation_function(parameterization,weight=None):
    x = np.array([parameterization.get(f"x{i}") for i in range(1)])
    print("evaluation_function: ", x)
    return {"evaluation_function": (x**2, 0.0), "l2norm": (np.sqrt((x ** 2).sum()), 0.0)}

search_space = SearchSpace(
    parameters=[
        RangeParameter(
            name=f"x{i}", parameter_type=ParameterType.FLOAT, lower=0.0, upper=1.0
        )
        for i in range(1)
    ]
)

exp = SimpleExperiment(
    name="test",
    search_space=search_space,
    evaluation_function=evaluation_function,
    objective_name="evaluation_function",
    minimize=True,
    outcome_constraints=[
        OutcomeConstraint(
            metric=L2NormMetric(
                name="l2norm", param_names=[f"x{i}" for i in range(1)], noise_sd=0.2
            ),
            op=ComparisonOp.LEQ,
            bound=1.25,
            relative=False,
        )
    ],
)

sobol = Models.SOBOL(exp.search_space)
for i in range(5):
    exp.new_trial(generator_run=sobol.gen(1))
d=exp.eval()
gpei = Models.BOTORCH(experiment=exp, data=d)
print("Fit time:", gpei.fit_time)
t = time.time()
run=gpei.gen(1)
print("Gen time:", time.time()-t)

Results
My laptop:

4 core Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz:
Fit time: 0.12
Gen time: 0.30

Cloud-CPU:

0.5 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Fit time: 6.72
Gen time: 55.05
3 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Fit time: 25.64
Gen time: 49.90
6 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Fit time: 12.38
Gen time: 21.08

I cann't figure out what is the case of such difference. May be there are somethings with Cloud, but difference in 100x is very strange. Simple (one-thread) CPU benchmark doesn't show such a difference:

4 core Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz:
Avarage: 3.578s
0.5 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Avarage: 8.854s
3 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz:
Avarage: 2.997s

Any sudgestions how to debug it?

cc @Balandat , were there any underlying changes in PyTorch recently that might have caused this? It sounds like from OP that the slowdown only started recently.

Not that I am aware of. Our implementations heavily exploit vectorized computations via Intel's MKL, so if that is not available/installed correctly then you'll see significant slowdowns. @valentin7121 can you provide the MKL build setup of both your laptop and the cloud setup?

@Balandat I did not install the MKL manually both on the laptop and on the cloud. I only installed ax-platform with dependecies:
pip3 install ax-platform
How can I check MKL setup?
(Laptop OS: Ubuntu 20.04.2 LTS; cloud OS: Ubuntu 18.04.3 LTS)

On Ubuntu 18.04.3 LTS I have the error during installation:

ERROR: torchvision 0.4.1+cu100 has requirement torch==1.3.0, but you'll have torch 1.7.1 which is incompatible.
Installing collected packages: torch, gpytorch, botorch, retrying, plotly, typeguard, ax-platform
  Attempting uninstall: torch
    Found existing installation: torch 1.3.0+cu100
    Uninstalling torch-1.3.0+cu100:
      Successfully uninstalled torch-1.3.0+cu100
Successfully installed ax-platform-0.1.19 botorch-0.3.3 gpytorch-1.3.1 plotly-4.14.3 retrying-1.3.3 torch-1.7.1 typeguard-2.10.0

Could this be the reason for the slowdown?

cc @Balandat

Could this be the reason for the slowdown?

I doubt it - this just installs a much newer pytorch version, so that should be ok.

Can you paste the output of the following?

import torch
print(torch.__config__.show())

PyTorch config is the same both on the laptop and on the cloud:

PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

After rebuilding the OS image (Cloud), the problem seems to be solved. Even though the PyTorch configuration hasn't changed. Now 3-core Cloud-CPU in the same benchmark shows:

Fit time: 0.09515857696533203
Gen time: 0.5103681087493896

Thank you!

Interesting. Might have been a one-off issue with the cloud machine. Let us know if this re-occurs.

Was this page helpful?
0 / 5 - 0 ratings