Vision: About RegisterOperators in C++

Created on 23 Apr 2020  ยท  23Comments  ยท  Source: pytorch/vision

๐Ÿ› Bug

I'm testing about jit in C++ ( libtorch )

My test cases

int main(int argc, const char* argv[]) {
    auto& ops = torch::jit::getAllOperators();
    std::cout << "torch jit operators\n";
    for (auto& op: ops) {
        auto& name = op->schema().name();
        if (name.find("torchvision") != std::string::npos)
            std::cout << "op : " << op->schema().name() << "\n";
    }
    std::cout << "\n";
    return 0;
}

>>> torch jit operators

Expected (I changed some codes in vision)

>>> torch jit operators
op : torchvision::_cuda_version
op : torchvision::deform_conv2d
op : torchvision::ps_roi_pool
op : torchvision::ps_roi_align
op : torchvision::_new_empty_tensor_op
op : torchvision::roi_pool
op : torchvision::roi_align
op : torchvision::nms

In 1st case, seems no jit opeators registry on libtorch.

Is this only occurs my env? or some codes are wrong?

Because I checked expected output when I changed srcs.

Environment

Install command follows on reamdme

torch 1.5.0 and torchvision 0.6.0

help wanted c++ frontend ops torchscript

All 23 comments

This issue is better discussed on the forum. I will close this issue, but please feel free to re-open with more information and details about what was changed so that we can reproduce.

Moreover, if you encounter an issue with jit, I encourage you to open an issue on pytorch instead.

@vincentqb
This is not pytorch problem.
It's about call the registry API when using C++ torchvision.

And the problem is registry parts not calling when the custom project includes Torchvision

So I moved registry parts to header, then the project is running registry codes when compile.

Could you check the jit operators in C++?

If I understand correctly: you have one version that works, and another that doesn't? Can you provide a complete minimal example to run with how you compile so we can reproduce?

cc @fmassa

@vincentqb
Basically I just install by cmake and running hello_world
And Now I'm trying to 0.6, but 0.5 has same problem

  1. I installed following README.rst
cmake -DCMAKE_PREFIX_PATH="/home/user/torch_build/libtorch150" -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CUDA=on ..
make -j16
make install
  1. collect_env.py
user(t3) in ~/torch_build via env150 
โžœ python3 collect_env.py                  
Collecting environment information...
PyTorch version: 1.5.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 8.3.0-16ubuntu3~16.04) 8.3.0
CMake version: version 3.17.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 418.87.01
cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip3] numpy==1.18.3
[pip3] torch==1.5.0+cu101
[pip3] torchvision==0.6.0+cu101
[conda] Could not collect
  1. main.cpp
  2. Just change examples/cpp/hello_world/main.py
#include <iostream>

#include <torchvision/vision.h>

int main()
{
  auto& ops = torch::jit::getAllOperators();
  std::cout << "torch jit operators\n";
  for (auto& op: ops) {
      auto& name = op->schema().name();
      if (name.find("torchvision") != std::string::npos)
          std::cout << "op : " << op->schema().name() << "\n";
  }
  std::cout << "\n";
  return 0;
}

And I running upper main function.

I expect the torch::jit::getAllOperators in main can find operators related with torchvision.
But In my env, that shows nothing.

So I moved the torch::RegisterOperators() parts move to header(.h) not .cpp
If do this, compiler directly read the registry parts.

And then, shows torchvision methods as my expected.

In my experience, I same got problem on my custom kernel when I using jit.
So I changed registry called directly, then I can running fine and can find my kernel in getAllOperators

Thanks for the report @zsef123 !

ccing @suo , as I believe this is currently an issue with the torchscript RegisterOperators (or our use of it), which makes the operators not readily available for the users, who need instead to re-register the ops again in user code.

@fmassa
So the vision lib purposes each user manually register ops in projects not automatically registered by include torch header.

thats right? I think I mis-understood that design

That's currently what's needed, but I don't think this is a desirable setup.
Maybe we just need to move the RegisterOperator calls from the C++ file into the headers in torchvision, so that they get registered automatically once the user includes it.

Let's keep this issue open until we figure out a better way of handling this

I think what's happening is that the linker is throwing out the static registration of the operators when linking in the torchvision lib, since the RegisterOperator object is never referenced from the main executable. Can you try linking in torchvision with -Wl,--whole-archive?

@suo Thanks to comments!!

I'm trying to like under codes.

CMakeLists.txt

cmake_minimum_required(VERSION 3.10)
project(hello-world)

find_package(TorchVision REQUIRED)

add_executable(hello-world main.cpp)
target_link_libraries(hello-world
    -Wl,--whole-archive
    TorchVision::TorchVision
)

But this errors occurs

[ 50%] Linking CXX executable hello-world
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_muldi3.o): In function `__multi3':
(.text+0x0): multiple definition of `__multi3'
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_muldi3.o):(.text+0x0): first defined here
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_negdi2.o): In function `__negti2':
(.text+0x0): multiple definition of `__negti2'

...

Is this my setup problem?

user(t3) in vision/examples/cpp/hello_world/build on ๎‚  master [!?] via env150 took 2s 
โžœ cmake --version                          
cmake version 3.17.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

user(t3) in vision/examples/cpp/hello_world/build on ๎‚  master [!?] via env150 
โžœ gcc --version                               
gcc (Ubuntu 8.3.0-16ubuntu3~16.04) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

whole-archive is not the right flag. The real problem is because your executable isn't using any symbols from torchvision, the linker is pruning it away. You can verify that this is the case by running ldd on the resulting executable; you will see that there is a dependency on libtorch.so but not on torchvision. If you specify

target_link_libraries(hello-world
    -Wl,--no-as-needed
    TorchVision::TorchVision
)

This will force the linker to include.

@fmassa we should consider updating the TorchVision cmake target to automatically specify --no-as-needed, this will help prevent this common failure mode.

@bmanga could you look into adding --no-as-needed in the torchvision CMakeLists?

@fmassa May I suggest to use an inline variable or a function to register the operators instead?
I can give that a try first, so we can avoid hacks in the cmake scripts.

@ezyang what do you think?

I think it might be preferable to avoid having to ask users to register_torchvision_ops() in their C++ files.

@fmassa Inline variables should work just like the current approach. Officially they are C++17, but we can get essentially identical functionality using __attribute__((weak)).

I personally don't mind a RegisterOps function. I think there might be users of torchvision that don't need them, and for those who do it's just a function call in their main.

Either approach is IMO better than using linker hacks, otherwise we end up with a mess like pytorch's cmake scripts. As a bonus, we can even remove the hardcoded requirement that the lib must be built as a shared library.

EDIT: I think __attribute__((constructor)) is exactly what we need here. I will try to create a PR tomorrow.

@bmanga I don't think __attribute__((constructor)) will work? It will put things in the ctors section of your library, which is the same thing that static initializers do, and the linker will prune it away if you literally don't depend on any symbols.

There's nothing wrong with having a register operators function. It doesn't do any harm, and it will save you from having to do --no-as-needed. But, for better or worse, PyTorch as a library relies a LOT on static initializers, and I think it's better to fall in line with the prevailing convention as a downstream library (torchvision). So I am advocating that for the default behavior, we should pass in the right linker flags. It is not too much cmake.

cc'ing @malfet too

@ezyang true, late night thoughts should be left for the morning ๐Ÿ˜…

I think it is a bit unreasonable to expect the operators to be registered just by linking the library without even including a single header. Is that really a common case?

What about the following?

  • Have the user include any header from the library, even just vision.h
  • Document in the README that if you are only planning to use torchvision ops, you must either include something or add
    set_target_properties(exe PROPERTIES LINK_WHAT_YOU_USE TRUE) to the torchvision consumer target.

If you think that is still not acceptable, I will copy the code from caffe2_interface_library macro.

Hi @bmanga

Thanks for the discussion. After seeing the discussion here, I think it might be preferable to follow PyTorch's convention.

@bmanga can you clarify why we would need to copy the code from caffe2_interface_library, instead of just passing --no-as-needed?

@fmassa because if we really want this behavior

  • it should be done for windows, mac and linux
  • We want to limit it to torchvision only, so we need to add "-Wl,--no-as-needed,libtorchvision.so,-Wl,--as-needed"
  • If we can, it would be good not to tie ourselves into requiring that torchvision be built as a shared library

caffe2_interface_library mostly solves these problems, although it is not completely correct either because it assumes that the user target has --no-as-needed enabled by default, which may not be the case and be a source of other issues.

I agree that writing the relevant cmake code portably is not so easy. I am a bit curious how the #include trick would work; since I feel like you still have to use the symbols somehow, just externing this isn't enough. (Maybe if the header sets some weak symbol)?

@ezyang yes, that's what I had in mind when I mentioned inline variables.
In the vision.h header, we could have something like:

#if (defined __cpp_inline_variables) || __cplusplus >= 201703L
#define VISION_INLINE_VARIABLE inline
#else
#ifdef _MSC_VER
#define VISION_INLINE_VARIABLE __declspec(selectany)
#else
#define VISION_INLINE_VARIABLE __attribute__((weak))
#endif
#endif

namespace vision {
namespace impl {
int RegisterOps() noexcept;
VISION_INLINE_VARIABLE int dummy = RegisterOps();
}
}

And that should be enough to ensure that the library is loaded (RegisterOps is defined in a cpp file), and that the code is run only once.

It feels a bit dirty doing this, but I'm willing to give it a try.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dssa56 picture dssa56  ยท  60Comments

mcleonard picture mcleonard  ยท  26Comments

two-names picture two-names  ยท  22Comments

varunagrawal picture varunagrawal  ยท  45Comments

fmassa picture fmassa  ยท  45Comments