Glow: Accessing names of symbols from SymbolTable* inside BundleConfig causes SIGSEGV

Created on 31 May 2019 · 15Comments · Source: pytorch/glow

Running on Ubuntu 18.04 with:

(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/examples/bundles/x$ clang --version
clang version 7.0.0-3~ubuntu0.18.04.1 (tags/RELEASE_700/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

There are two issues here. First, the names in the SymbolTable inside BundleConfig do not seem to be consistent with those given in the ONNX graph. For instance, consider this graph:

graph(%in : Float(1, 2)
      %1 : Float(8, 2)
      %2 : Float(8)
      %3 : Float(8, 8)
      %4 : Float(8)
      %5 : Float(8, 8)
      %6 : Float(8)
      %7 : Float(1, 8)
      %8 : Float(1)) {
  %9 : Float(1, 8) = onnx::Gemm[alpha=1, beta=1, transB=1](%in, %1, %2), scope: FeedForwardNN/Linear
  %10 : Float(1, 8) = onnx::Relu(%9), scope: FeedForwardNN/ReLU
  %11 : Float(1, 8) = onnx::Gemm[alpha=1, beta=1, transB=1](%10, %3, %4), scope: FeedForwardNN/ReLU
  %12 : Float(1, 8) = onnx::Relu(%11), scope: FeedForwardNN/ReLU
  %13 : Float(1, 8) = onnx::Gemm[alpha=1, beta=1, transB=1](%12, %5, %6), scope: FeedForwardNN/ReLU
  %14 : Float(1, 8) = onnx::Relu(%13), scope: FeedForwardNN/ReLU
  %15 : Float(1, 1) = onnx::Gemm[alpha=1, beta=1, transB=1](%14, %7, %8), scope: FeedForwardNN/ReLU
  %out : Float(1, 1) = onnx::Sigmoid(%15), scope: FeedForwardNN/Sigmoid
  return (%out);
}

generated by

torch.onnx.export(ffnn, inp, "ffnn.onnx", input_names="in", 
                  output_names="out", verbose=True, export_params=True)

Now viewing the symbols inside the generated bundle (with relocation-model=pic):

(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -p 5 ./output/x.o
String dump of section '.rodata':
  [     0]  save_out
  [     9]  in

In other words, Glow seems to prepend the output tensor name with save_. I don't think this is documented anywhere.

Now the real issue. Consider the following innocuous code that simply returns the (bounded) string length of the names from the symbol table:

#include <stdio.h>
#include <stdint.h>

struct SymbolTableEntry {
    const char *name;
    uint64_t offset;
    uint64_t size;
    char kind;
};

struct BundleConfig {
    uint64_t cwv_size;
    uint64_t mwv_size;
    uint64_t a_size;
    uint64_t alignment;
    uint64_t nsymbols;
    const struct SymbolTableEntry *symbols;
};

extern struct BundleConfig x_config;

int main(int argc, char **argv)
{
    printf("Total number of symbols: %d\n", x_config.nsymbols);

    for (int jj = 0; jj < x_config.nsymbols; ++jj) {
        if (x_config.symbols[jj].name != NULL)
            printf("strnlen(sym[%d].name, 8) == %d\n", jj, strnlen(x_config.symbols[jj].name, 8));
    }

    return 0;
}

Compiling and running the code gives:

(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/examples/bundles/x$ ./ttt
Total number of symbols: 8
strnlen(sym[0].name, 8) == 8
strnlen(sym[1].name, 8) == 2
Segmentation fault (core dumped)

Let's look at those symbols inside x.o again:

(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -s ./output/x.o

Symbol table '.symtab' contains 19 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS llvm-link
     2: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT    4 .LCPI7_0
     3: 0000000000000004     0 NOTYPE  LOCAL  DEFAULT    4 .LCPI7_1
     4: 00000000000000a0   377 FUNC    LOCAL  DEFAULT    2 libjit_matmul_f_0_special
     5: 00000000000002c0   272 FUNC    LOCAL  DEFAULT    2 libjit_matmul_f_2_special
     6: 0000000000000510   195 FUNC    LOCAL  DEFAULT    2 libjit_matmul_f_5_special
     7: 00000000000003d0   159 FUNC    LOCAL  DEFAULT    2 libjit_stacked_kernel.1_3
     8: 0000000000000470   159 FUNC    LOCAL  DEFAULT    2 libjit_stacked_kernel.2_4
     9: 00000000000005e0    62 FUNC    LOCAL  DEFAULT    2 libjit_stacked_kernel.3_6
    10: 0000000000000220   159 FUNC    LOCAL  DEFAULT    2 libjit_stacked_kernel_1_s
    11: 0000000000000000    64 OBJECT  LOCAL  DEFAULT    6 xSymbolTable
    12: 0000000000000000     0 SECTION LOCAL  DEFAULT    5 
    13: 0000000000000000     0 SECTION LOCAL  DEFAULT    6 
    14: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND expf
    15: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND free
    16: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND posix_memalign
    17: 0000000000000000   158 FUNC    GLOBAL DEFAULT    2 x
    18: 0000000000000040    48 OBJECT  GLOBAL DEFAULT    6 x_config

Ok, now dump .rodata:

(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -x .rodata ./output/x.o

Hex dump of section '.rodata':
  0x00000000 73617665 5f6f7574 00696e00          save_out.in.

(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$

We only see save_out and in. Where are the other names? Compiling with -relocation-model=static gives:

(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -x .rodata ./output/x.o

Hex dump of section '.rodata':
 NOTE: This section has relocations against it, but these have NOT been applied to this dump.
  0x00000000 73617665 5f6f7574 00696e00 00000000 save_out.in.....
  0x00000010 00000000 00000000 00000000 00000000 ................
  0x00000020 01000000 00000000 01000000 00000000 ................
  0x00000030 00000000 00000000 40000000 00000000 ........@.......
  0x00000040 02000000 00000000 01000000 00000000 ................
  0x00000050 80030000 00000000 80000000 00000000 ................
  0x00000060 80000000 00000000 40000000 00000000 ........@.......
  0x00000070 08000000 00000000 00000000 00000000 ................

The names have just been relocated, that's all. Now viewing relocations:

(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -r ./output/x.o

Relocation section '.rela.text' at offset 0x960 contains 9 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
0000000000cd  000f00000004 R_X86_64_PLT32    0000000000000000 posix_memalign - 4
00000000020a  000e00000004 R_X86_64_PLT32    0000000000000000 free - 4
0000000002ed  000f00000004 R_X86_64_PLT32    0000000000000000 posix_memalign - 4
0000000003c1  000e00000004 R_X86_64_PLT32    0000000000000000 free - 4
000000000538  000f00000004 R_X86_64_PLT32    0000000000000000 posix_memalign - 4
0000000005c4  000e00000004 R_X86_64_PLT32    0000000000000000 free - 4
0000000005f6  000200000002 R_X86_64_PC32     0000000000000000 .LCPI7_0 - 4
0000000005ff  000d00000004 R_X86_64_PLT32    0000000000000000 expf - 4
000000000607  000300000002 R_X86_64_PC32     0000000000000004 .LCPI7_1 - 4

Relocation section '.rela.rodata' at offset 0xa38 contains 3 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000010  000c00000001 R_X86_64_64       0000000000000000 .rodata + 0
000000000030  000c00000001 R_X86_64_64       0000000000000000 .rodata + 9
000000000078  000c00000001 R_X86_64_64       0000000000000000 .rodata + 10

Source

ghost

All 15 comments

Hmm. Interesting.

First, the names in the SymbolTable inside BundleConfig do not seem to be consistent with those given in the ONNX graph.

Yes, the names used by glow are not 100% the same as the original ones. There are multiple reasons for that:

Glow converts all names into a format where they look like valid C identifiers. This is to make it possible to debug bundles using debuggers like GDB or LLDB.
Glow also performs a number of transformations and optimizations where it creates new nodes based on old ones, etc. As a result, there are often name conflicts and to avoid them Glow adds some suffixes to the names to make them unique.
That being said, it could be useful to have a way to get the original name from a symbol table.

Where are the other names?

What names do you expect to see? In general, the symbol table contains only the names of input/output tensors.

Could you actually show the result of -dump-ir on your model? It would give a better idea of names used by the model as Glow sees them.

opti-mix on 31 May 2019

What names do you expect to see? In general, the symbol table contains only the names of input/output tensors.

Yep, that is what I expect. However, when looking for the offset of those tensors, I run through the SymbolTable, grabbing all names in it and comparing to in and save_out. Once found, I grab the corresponding offset. BundleConfig is reporting 8 symbols in the symbol table. So I run over the 8 symbols, while, apparently, only 2 of those 8 contain "valid names."

So is there a better way to traverse the symbol table to identify the input and output tensor offsets? For instance, is it guaranteed that the output tensor is always at index 0 in the symbol table, and the input is always at index 1? Moreover, this method seems fragile also because, as you said, Glow can mangle original names.

ghost on 31 May 2019

Could you actually show the result of -dump-ir on your model? It would give a better idea of names used by the model as Glow sees them.

Yep, here it is:

declare {
  %A21 = WeightVar float<1 x 8> const // size: 32 // Users: @in 2
  %A41 = WeightVar float<1 x 8> const // size: 32 // Users: @in 7
  %A61 = WeightVar float<1 x 8> const // size: 32 // Users: @in 12
  %A81 = WeightVar float<1 x 1> const // size: 4 // Users: @in 16
  %A71 = WeightVar float<8 x 1> const // size: 32 // Users: @in 14
  %A11 = WeightVar float<2 x 8> const // size: 64 // Users: @in 1
  %A31 = WeightVar float<8 x 8> const // size: 256 // Users: @in 5
  %A51 = WeightVar float<8 x 8> const // size: 256 // Users: @in 10
  %in = WeightVar float<1 x 2> mutable // size: 8 // Users: @in 1
  %save_out = WeightVar float<1 x 1> mutable // size: 4 // Users: @in 16, @out 14, @in 17, @out 16, @out 17

  ; size = 720 bytes
}

code {
  0 %A92_res = allocactivation  { Ty: float<1 x 8>} // size: 32 // Users: @in 2, @out 1, @in 5, @out 3, @out 6, @in 3, @out 2
  1 %A91 = matmul @out %A92_res, @in %in, @in %A11
  2 %A92 = elementadd @out %A92_res, @in %A92_res, @in %A21
  3 %relu31 = cpumaxsplat @out %A92_res, @in %A92_res { SplatValue: 0.000000e+00}
  4 %A112_res = allocactivation  { Ty: float<1 x 8>} // size: 32 // Users: @in 7, @out 5, @in 10, @out 8, @out 11, @in 8, @out 7
  5 %A111 = matmul @out %A112_res, @in %A92_res, @in %A31
  6 %dealloc1 = deallocactivation @out %A92_res // size: 32
  7 %A112 = elementadd @out %A112_res, @in %A112_res, @in %A41
  8 %relu111 = cpumaxsplat @out %A112_res, @in %A112_res { SplatValue: 0.000000e+00}
  9 %A132_res = allocactivation  { Ty: float<1 x 8>} // size: 32 // Users: @in 12, @out 10, @in 14, @out 13, @out 15, @in 13, @out 12
  10 %A131 = matmul @out %A132_res, @in %A112_res, @in %A51
  11 %dealloc4 = deallocactivation @out %A112_res // size: 32
  12 %A132 = elementadd @out %A132_res, @in %A132_res, @in %A61
  13 %relu211 = cpumaxsplat @out %A132_res, @in %A132_res { SplatValue: 0.000000e+00}
  14 %A151 = matmul @out %save_out, @in %A132_res, @in %A71
  15 %dealloc7 = deallocactivation @out %A132_res // size: 32
  16 %A152 = elementadd @out %save_out, @in %save_out, @in %A81
  17 %out = sigmoid @out %save_out, @in %save_out
}

ghost on 31 May 2019

Ok! We can see in the ir-dump that only two WeightVars are marked as mutable, which is the indication that they are inputs or outputs. All others are constants, i.e. weights and thus their names are not saved in the symbol table currently (though we could if there is interest in this).

But I suspect that the number of entries reported in the symbol table may be wrong. Need to check a couple of things to confirm it.

opti-mix on 31 May 2019

@opti-mix Thanks for looking into this! I had that suspicion too, but wasn't completely sure (so edited the original post to remove "could BundleConfig::numSymbols be wrong?"). It is the only logical explanation for SIGSEGV that I could find so far.

ghost on 31 May 2019

From BundleSaver.cpp in void BundleSaver::emitBundleConfig():

  config->setInitializer(llvm::ConstantStruct::get(
      bundleConfigTy,
      llvm::ConstantInt::get(
          uint64TType, irgen_->getAllocationsInfo().constantWeightVarsMemSize_),
      llvm::ConstantInt::get(
          uint64TType, irgen_->getAllocationsInfo().mutableWeightVarsMemSize_),
      llvm::ConstantInt::get(uint64TType,
                             irgen_->getAllocationsInfo().activationsMemSize_),

      llvm::ConstantInt::get(uint64TType, TensorAlignment),
      llvm::ConstantInt::get(uint64TType, F_->findConstants().size()),

      symbolTable));

Should the line

llvm::ConstantInt::get(uint64TType, F_->findConstants().size())

instead be

llvm::ConstantInt::get(uint64TType, F_->findPlaceholders().size())

ghost on 31 May 2019

Yep, I can confirm that changing to findPlaceholders() works correctly.

ghost on 31 May 2019

👍1

Very good catch! findConstants seems to be obviously wrong to me.

But even findPlaceholders could be not precise enough. The thing is that placeholders are defined for the whole module, but of them may be not actually used in the code and no WeightVars are allocated for them. A symbol table probably should not contain any entries about such placeholders as they are useless.

opti-mix on 31 May 2019

Yep, I can confirm that changing to findPlaceholders() works correctly.

I guess now you are going to promise me a 3rd PR, or? :-D

opti-mix on 31 May 2019

A symbol table probably should not contain any entries about such placeholders as they are useless.

Agreed. Let's keep this issue open until the correct fix is applied. As of right now, I am just going to do this dirty fix on my end to keep things rolling here.

ghost on 31 May 2019

👍1

I guess now you are going to promise me a 3rd PR, or? :-D

Of course.

ghost on 31 May 2019

The thing is that placeholders are defined for the whole module, but of them may be not actually used in the code and no WeightVars are allocated for them. A symbol table probably should not contain any entries about such placeholders as they are useless.

CC: @gcatron -- Function::findPlaceholders() / Function::findConstants() actually does the right thing here, I believe. Probably could have been named better.

jfix71 on 31 May 2019

👍1

The thing is that placeholders are defined for the whole module, but of them may be not actually used in the code and no WeightVars are allocated for them. A symbol table probably should not contain any entries about such placeholders as they are useless.

CC: @gcatron -- Function::findPlaceholders() / Function::findConstants() actually does the right thing here, I believe. Probably could have been named better.

Function::findPlaceholders() should be right here. It only finds placeholders used by the function which should all be used.