Tvm: apply_history_best requires extract_from_program to be run even if log file already exists

Created on 20 Jul 2020  路  14Comments  路  Source: apache/tvm

I'm running into a weird bug where apply_history_best doesn't work unless extract_from_program has been called somewhere before in the program (given the log file already exists).

Here is a minimal example:

import os
import sys

import numpy as np
import tvm
from tvm import te
from tvm import autotvm
from tvm import relay
import tvm.relay.testing
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
from tvm.contrib.util import tempdir
import tvm.contrib.graph_runtime as runtime

input_shape = (1, 3, 224, 224)
output_shape = (1, 1000)

def run(mod, params, ctx):
    lib = relay.build(mod, target="llvm", params=params)

    dummy_data = np.random.uniform(size=input_shape).astype("float32")

    m = runtime.GraphModule(lib['default'](ctx))
    m.set_input('data', dummy_data)
    m.run()
    tvm_output = m.get_output(0)

    ftimer = m.module.time_evaluator("run", ctx, repeat=5, number=5)
    prof_res = np.array(ftimer().results) * 1000
    print(
        "%-20s %-19s (%s)"
        % ("Runtime:", "%.2f ms" % np.mean(prof_res), "%.2f ms" % np.std(prof_res))
    )


if __name__ == "__main__":
    mod, params = tvm.relay.testing.resnet.get_workload()

    ctx = tvm.cpu()
    print("Untuned")
    run(mod, params, ctx)

    log_filename = "bug_tuning.log"
    tmp_log_file = log_filename + ".tmp"
    if os.path.exists(tmp_log_file):
        os.remove(tmp_log_file)

    if sys.argv[1] == "tune" or sys.argv[1] == "extract":
        tasks = autotvm.task.extract_from_program(mod["main"], target="llvm",
                                                  params=params,
                                                  )

        if sys.argv[1] == "tune":
            for i, task in enumerate(tasks[0:4]):
                prefix = "[Task %2d/%2d] " % (i+1, len(tasks))
                tuner_obj = GridSearchTuner(task)

                # do tuning
                measure_option = autotvm.measure_option(
                        builder=autotvm.LocalBuilder(),
                        runner=autotvm.LocalRunner(number=10, repeat=1,
                                                   min_repeat_ms=1000),
                    )
                n_trial= min(len(task.config_space), 10)
                tuner_obj.tune(n_trial=n_trial,
                               early_stopping=False,
                               measure_option=measure_option,
                               callbacks=[
                                   autotvm.callback.progress_bar(n_trial, prefix=prefix),
                                   autotvm.callback.log_to_file(tmp_log_file)])

            # pick best records to a cache file
            autotvm.record.pick_best(tmp_log_file, log_filename)
            os.remove(tmp_log_file)

    print("Tuned")
    with autotvm.apply_history_best(log_filename):
        run(mod, params, ctx)

Run it like so:

python3 example.py tune
python3 example.py extract
python3 example.py run

On my laptop (macOS, quad-core 2.3GHz i7) I get the following output:

$ python3 simple_bug.py tune
Untuned
Cannot find config for target=llvm -keys=cpu, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Runtime:             41.48 ms            (2.09 ms)
[Task  1/13]  Current/Best:   17.93/  34.07 GFLOPS | Progress: (10/10) | 27.28 s Done.
[Task  2/13]  Current/Best:    7.23/   7.24 GFLOPS | Progress: (10/10) | 18.49 s Done.
[Task  3/13]  Current/Best:   14.49/  14.49 GFLOPS | Progress: (10/10) | 22.80 s Done.
[Task  4/13]  Current/Best:   14.46/  14.46 GFLOPS | Progress: (10/10) | 18.14 s Done.
Tuned
Runtime:             178.57 ms           (1.56 ms)

$ python3 simple_bug.py extract
Untuned
Cannot find config for target=llvm -keys=cpu, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Runtime:             40.87 ms            (0.60 ms)
Tuned
Runtime:             178.05 ms           (1.02 ms)

$ python3 simple_bug.py run
Untuned
Cannot find config for target=llvm -keys=cpu, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Runtime:             40.37 ms            (0.75 ms)
Tuned
Runtime:             39.85 ms            (0.69 ms)

Ignore the fact that tuning makes results worse. I only run a couple iterations of the tuner to make the program run quickly.

On both tune and extract, the tuned performance is different from the untuned performance indicating that a tuned schedule was used. But on run, performance is the same between tuned and untuned. The only difference between extract and run is the call to autotvm.task.extract_from_program. I'm not clear what exactly the cause of this performance discrepancy is.

All 14 comments

Could you add compile_engine.get().clear() (require from tvm.relay.backend import compile_engine)before with autotvm.apply_history_best(log_filename) to see if it could solve this issue?

Adding compile_engine.get().clear() does not fix the issue.

I did a few tests with the script you provided and I'm fairly certain this has nothing to do with apply_history_best. I think this is instead an issue of whether tophub schedules are applied or not. It seems like calling task extraction registers tophub schedules that can override the manually tuned schedules, which in your case seems to be causing the worse performance. A good way to test this would be using a workload that isnt in tophub, maybe just by changing the input shapes to resnet.

I ran the code and found adding compile_engine.get().clear() before with autotvm.apply_history_best(log_filename): acutally fixes the issue.
This issue is not related to autotvm at all. The autotvm feature extraction is not stateful.
Instead, the relay compiler is stateful. We have to clear the relay cache before using a different schedule for the same relay subgraph.
@tkonolige @jwfromm Could you double-check?

To prevent such issues from happening again, we can force relay to clear the cache after every relay.build.
So the cache mechanism in relay only works within a single module.

Now it is working with compile_engine.get().clear(), but I had to remove the tuning log first.

What do you mean by "remove the tuning log"?
If you add compile_engine.get().clear(), then the result of python3 example.py extract will match the result of python3 example.py run exactly.

In python3 example.py extract, the autotvm task extraction calls compile_engine.get().clear() internally.
In python3 example.py run, in order to match the above command, we can manully call compile_engine.get().clear()

@merrymercy, it seems like the keys are still being saved in the DispatchContext memory even after clearing the cache. An easy to way to see this is doing build twice on the same graph with no logs and clearing the cache in between. We see the usual warning message Cannot find config for ... for the first builds operations but not the second. Should we be doing a full clear?

@merrymercy When I tested compile_engine.get().clear() the first time I might have had a corrupted or incorrect tuning log. The second time I test compile_engine.get().clear() I removed the tuning log before hand. It is unclear this had any effect.

@jroesch Yes, the behavior you mentioned is desired. That is designed by me to mimic the behavior of logging.warning in python's built-in library. The principle is that a warning message should be printed only once.

However, this behavior does not affect performance. Because if you see such a warning, it means the returned config is an empty FallbackConfigEntity. This config will be cached in FallbackContext. Caching a empty config does not affect anything.
If next time you have a tuned config, it has a higher priority to be selected than the empty config in FallbackContext.
If next time you still do not have a tuned config, then returning a cached empty config is the same as returning an uncached empty config.

ok that all makes sense to me, thanks for the clarification @merrymercy!

We should probably try to upstream a PR that adds compile_engine.get().clear() to the end of the relay.build function.

@tkonolige @jwfromm yeah the caching makes sense to me, this is why I was against it a long time ago 馃槅 . I think we should add to the C++ implementation of build to ensure that we don't accidentally trigger this behavior in other places. The main purpose of the cache is not code generate identical ops inside of the same graph from my point-of-view.

cc @tqchen

To conclude, the current cache mechanism in relay works globally across different calls to relay.build. If we want to use two different schedules for the same relay subgraph in two different relay.build, we have to call compile_engine.get().clear() between two relay.build.

In @tkonolige 's test, the autotvm task extraction calls compile_engine.get().clear() internllay, making @tkonolige draw the conclusion that task extraction is required. But actually compile_engine.get().clear() is what we need.

To fix the issue, we all agree that we should make the relay cache work only within a single module.
@tkonolige Could you send the fix?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

KayneWest picture KayneWest  路  6Comments

jroesch picture jroesch  路  5Comments

derisavi picture derisavi  路  6Comments

zhiqwang picture zhiqwang  路  4Comments

MarisaKirisame picture MarisaKirisame  路  5Comments