Vision: About RegisterOperators in C++

Created on 23 Apr 2020 · 23Comments · Source: pytorch/vision

🐛 Bug

I'm testing about jit in C++ ( libtorch )

My test cases

int main(int argc, const char* argv[]) {
    auto& ops = torch::jit::getAllOperators();
    std::cout << "torch jit operators\n";
    for (auto& op: ops) {
        auto& name = op->schema().name();
        if (name.find("torchvision") != std::string::npos)
            std::cout << "op : " << op->schema().name() << "\n";
    }
    std::cout << "\n";
    return 0;
}

>>> torch jit operators

Expected (I changed some codes in vision)

>>> torch jit operators
op : torchvision::_cuda_version
op : torchvision::deform_conv2d
op : torchvision::ps_roi_pool
op : torchvision::ps_roi_align
op : torchvision::_new_empty_tensor_op
op : torchvision::roi_pool
op : torchvision::roi_align
op : torchvision::nms

In 1st case, seems no jit opeators registry on libtorch.

Is this only occurs my env? or some codes are wrong?

Because I checked expected output when I changed srcs.

Environment

Install command follows on reamdme

torch 1.5.0 and torchvision 0.6.0

help wanted c++ frontend ops torchscript

Source

zsef123

All 23 comments

This issue is better discussed on the forum. I will close this issue, but please feel free to re-open with more information and details about what was changed so that we can reproduce.

Moreover, if you encounter an issue with jit, I encourage you to open an issue on pytorch instead.

vincentqb on 23 Apr 2020

@vincentqb
This is not pytorch problem.
It's about call the registry API when using C++ torchvision.

And the problem is registry parts not calling when the custom project includes Torchvision

So I moved registry parts to header, then the project is running registry codes when compile.

Could you check the jit operators in C++?

zsef123 on 24 Apr 2020

If I understand correctly: you have one version that works, and another that doesn't? Can you provide a complete minimal example to run with how you compile so we can reproduce?

cc @fmassa

vincentqb on 25 Apr 2020

@vincentqb
Basically I just install by cmake and running hello_world
And Now I'm trying to 0.6, but 0.5 has same problem

I installed following README.rst

cmake -DCMAKE_PREFIX_PATH="/home/user/torch_build/libtorch150" -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CUDA=on ..
make -j16
make install

collect_env.py

user(t3) in ~/torch_build via env150 
➜ python3 collect_env.py                  
Collecting environment information...
PyTorch version: 1.5.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 8.3.0-16ubuntu3~16.04) 8.3.0
CMake version: version 3.17.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 418.87.01
cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip3] numpy==1.18.3
[pip3] torch==1.5.0+cu101
[pip3] torchvision==0.6.0+cu101
[conda] Could not collect

main.cpp
Just change examples/cpp/hello_world/main.py

#include <iostream>

#include <torchvision/vision.h>

int main()
{
  auto& ops = torch::jit::getAllOperators();
  std::cout << "torch jit operators\n";
  for (auto& op: ops) {
      auto& name = op->schema().name();
      if (name.find("torchvision") != std::string::npos)
          std::cout << "op : " << op->schema().name() << "\n";
  }
  std::cout << "\n";
  return 0;
}

And I running upper main function.

I expect the torch::jit::getAllOperators in main can find operators related with torchvision.
But In my env, that shows nothing.

So I moved the torch::RegisterOperators() parts move to header(.h) not .cpp
If do this, compiler directly read the registry parts.

And then, shows torchvision methods as my expected.

In my experience, I same got problem on my custom kernel when I using jit.
So I changed registry called directly, then I can running fine and can find my kernel in getAllOperators

zsef123 on 25 Apr 2020

Thanks for the report @zsef123 !

ccing @suo , as I believe this is currently an issue with the torchscript RegisterOperators (or our use of it), which makes the operators not readily available for the users, who need instead to re-register the ops again in user code.

fmassa on 27 Apr 2020

@fmassa
So the vision lib purposes each user manually register ops in projects not automatically registered by include torch header.

thats right? I think I mis-understood that design

zsef123 on 27 Apr 2020

That's currently what's needed, but I don't think this is a desirable setup.
Maybe we just need to move the RegisterOperator calls from the C++ file into the headers in torchvision, so that they get registered automatically once the user includes it.

fmassa on 27 Apr 2020

👍1

Let's keep this issue open until we figure out a better way of handling this

fmassa on 27 Apr 2020

I think what's happening is that the linker is throwing out the static registration of the operators when linking in the torchvision lib, since the RegisterOperator object is never referenced from the main executable. Can you try linking in torchvision with -Wl,--whole-archive?

suo on 27 Apr 2020

👍1

@suo Thanks to comments!!

I'm trying to like under codes.

CMakeLists.txt

cmake_minimum_required(VERSION 3.10)
project(hello-world)

find_package(TorchVision REQUIRED)

add_executable(hello-world main.cpp)
target_link_libraries(hello-world
    -Wl,--whole-archive
    TorchVision::TorchVision
)

But this errors occurs

[ 50%] Linking CXX executable hello-world
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_muldi3.o): In function `__multi3':
(.text+0x0): multiple definition of `__multi3'
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_muldi3.o):(.text+0x0): first defined here
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_negdi2.o): In function `__negti2':
(.text+0x0): multiple definition of `__negti2'

...

Is this my setup problem?

user(t3) in vision/examples/cpp/hello_world/build on  master [!?] via env150 took 2s 
➜ cmake --version                          
cmake version 3.17.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

user(t3) in vision/examples/cpp/hello_world/build on  master [!?] via env150 
➜ gcc --version                               
gcc (Ubuntu 8.3.0-16ubuntu3~16.04) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

zsef123 on 29 Apr 2020

whole-archive is not the right flag. The real problem is because your executable isn't using any symbols from torchvision, the linker is pruning it away. You can verify that this is the case by running ldd on the resulting executable; you will see that there is a dependency on libtorch.so but not on torchvision. If you specify

target_link_libraries(hello-world
    -Wl,--no-as-needed
    TorchVision::TorchVision
)

This will force the linker to include.

@fmassa we should consider updating the TorchVision cmake target to automatically specify --no-as-needed, this will help prevent this common failure mode.

ezyang on 30 Apr 2020

👍1

@bmanga could you look into adding --no-as-needed in the torchvision CMakeLists?

fmassa on 5 May 2020

@fmassa May I suggest to use an inline variable or a function to register the operators instead?
I can give that a try first, so we can avoid hacks in the cmake scripts.

bmanga on 5 May 2020

@ezyang what do you think?

I think it might be preferable to avoid having to ask users to register_torchvision_ops() in their C++ files.

fmassa on 5 May 2020

@fmassa Inline variables should work just like the current approach. Officially they are C++17, but we can get essentially identical functionality using __attribute__((weak)).

I personally don't mind a RegisterOps function. I think there might be users of torchvision that don't need them, and for those who do it's just a function call in their main.

Either approach is IMO better than using linker hacks, otherwise we end up with a mess like pytorch's cmake scripts. As a bonus, we can even remove the hardcoded requirement that the lib must be built as a shared library.

EDIT: I think __attribute__((constructor)) is exactly what we need here. I will try to create a PR tomorrow.

bmanga on 5 May 2020

@bmanga I don't think __attribute__((constructor)) will work? It will put things in the ctors section of your library, which is the same thing that static initializers do, and the linker will prune it away if you literally don't depend on any symbols.

There's nothing wrong with having a register operators function. It doesn't do any harm, and it will save you from having to do --no-as-needed. But, for better or worse, PyTorch as a library relies a LOT on static initializers, and I think it's better to fall in line with the prevailing convention as a downstream library (torchvision). So I am advocating that for the default behavior, we should pass in the right linker flags. It is not too much cmake.

ezyang on 6 May 2020

cc'ing @malfet too

ezyang on 6 May 2020

@ezyang true, late night thoughts should be left for the morning 😅

I think it is a bit unreasonable to expect the operators to be registered just by linking the library without even including a single header. Is that really a common case?

What about the following?

Have the user include any header from the library, even just vision.h
Document in the README that if you are only planning to use torchvision ops, you must either include something or add
set_target_properties(exe PROPERTIES LINK_WHAT_YOU_USE TRUE) to the torchvision consumer target.

If you think that is still not acceptable, I will copy the code from caffe2_interface_library macro.

bmanga on 6 May 2020

Hi @bmanga

Thanks for the discussion. After seeing the discussion here, I think it might be preferable to follow PyTorch's convention.

@bmanga can you clarify why we would need to copy the code from caffe2_interface_library, instead of just passing --no-as-needed?

fmassa on 6 May 2020

@fmassa because if we really want this behavior

it should be done for windows, mac and linux
We want to limit it to torchvision only, so we need to add "-Wl,--no-as-needed,libtorchvision.so,-Wl,--as-needed"
If we can, it would be good not to tie ourselves into requiring that torchvision be built as a shared library

caffe2_interface_library mostly solves these problems, although it is not completely correct either because it assumes that the user target has --no-as-needed enabled by default, which may not be the case and be a source of other issues.

bmanga on 6 May 2020

I agree that writing the relevant cmake code portably is not so easy. I am a bit curious how the #include trick would work; since I feel like you still have to use the symbols somehow, just externing this isn't enough. (Maybe if the header sets some weak symbol)?

ezyang on 6 May 2020

@ezyang yes, that's what I had in mind when I mentioned inline variables.
In the vision.h header, we could have something like:

#if (defined __cpp_inline_variables) || __cplusplus >= 201703L
#define VISION_INLINE_VARIABLE inline
#else
#ifdef _MSC_VER
#define VISION_INLINE_VARIABLE __declspec(selectany)
#else
#define VISION_INLINE_VARIABLE __attribute__((weak))
#endif
#endif

namespace vision {
namespace impl {
int RegisterOps() noexcept;
VISION_INLINE_VARIABLE int dummy = RegisterOps();
}
}

And that should be enough to ensure that the library is loaded (RegisterOps is defined in a cpp file), and that the code is run only once.

bmanga on 7 May 2020

It feels a bit dirty doing this, but I'm willing to give it a try.

ezyang on 8 May 2020

Was this page helpful?

0 / 5 - 0 ratings