Running on Ubuntu 18.04 with:
(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/examples/bundles/x$ clang --version
clang version 7.0.0-3~ubuntu0.18.04.1 (tags/RELEASE_700/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
There are two issues here. First, the names in the SymbolTable inside BundleConfig do not seem to be consistent with those given in the ONNX graph. For instance, consider this graph:
graph(%in : Float(1, 2)
%1 : Float(8, 2)
%2 : Float(8)
%3 : Float(8, 8)
%4 : Float(8)
%5 : Float(8, 8)
%6 : Float(8)
%7 : Float(1, 8)
%8 : Float(1)) {
%9 : Float(1, 8) = onnx::Gemm[alpha=1, beta=1, transB=1](%in, %1, %2), scope: FeedForwardNN/Linear
%10 : Float(1, 8) = onnx::Relu(%9), scope: FeedForwardNN/ReLU
%11 : Float(1, 8) = onnx::Gemm[alpha=1, beta=1, transB=1](%10, %3, %4), scope: FeedForwardNN/ReLU
%12 : Float(1, 8) = onnx::Relu(%11), scope: FeedForwardNN/ReLU
%13 : Float(1, 8) = onnx::Gemm[alpha=1, beta=1, transB=1](%12, %5, %6), scope: FeedForwardNN/ReLU
%14 : Float(1, 8) = onnx::Relu(%13), scope: FeedForwardNN/ReLU
%15 : Float(1, 1) = onnx::Gemm[alpha=1, beta=1, transB=1](%14, %7, %8), scope: FeedForwardNN/ReLU
%out : Float(1, 1) = onnx::Sigmoid(%15), scope: FeedForwardNN/Sigmoid
return (%out);
}
generated by
torch.onnx.export(ffnn, inp, "ffnn.onnx", input_names="in",
output_names="out", verbose=True, export_params=True)
Now viewing the symbols inside the generated bundle (with relocation-model=pic):
(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -p 5 ./output/x.o
String dump of section '.rodata':
[ 0] save_out
[ 9] in
In other words, Glow seems to prepend the output tensor name with save_. I don't think this is documented anywhere.
Now the real issue. Consider the following innocuous code that simply returns the (bounded) string length of the names from the symbol table:
#include <stdio.h>
#include <stdint.h>
struct SymbolTableEntry {
const char *name;
uint64_t offset;
uint64_t size;
char kind;
};
struct BundleConfig {
uint64_t cwv_size;
uint64_t mwv_size;
uint64_t a_size;
uint64_t alignment;
uint64_t nsymbols;
const struct SymbolTableEntry *symbols;
};
extern struct BundleConfig x_config;
int main(int argc, char **argv)
{
printf("Total number of symbols: %d\n", x_config.nsymbols);
for (int jj = 0; jj < x_config.nsymbols; ++jj) {
if (x_config.symbols[jj].name != NULL)
printf("strnlen(sym[%d].name, 8) == %d\n", jj, strnlen(x_config.symbols[jj].name, 8));
}
return 0;
}
Compiling and running the code gives:
(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/examples/bundles/x$ ./ttt
Total number of symbols: 8
strnlen(sym[0].name, 8) == 8
strnlen(sym[1].name, 8) == 2
Segmentation fault (core dumped)
Let's look at those symbols inside x.o again:
(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -s ./output/x.o
Symbol table '.symtab' contains 19 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS llvm-link
2: 0000000000000000 0 NOTYPE LOCAL DEFAULT 4 .LCPI7_0
3: 0000000000000004 0 NOTYPE LOCAL DEFAULT 4 .LCPI7_1
4: 00000000000000a0 377 FUNC LOCAL DEFAULT 2 libjit_matmul_f_0_special
5: 00000000000002c0 272 FUNC LOCAL DEFAULT 2 libjit_matmul_f_2_special
6: 0000000000000510 195 FUNC LOCAL DEFAULT 2 libjit_matmul_f_5_special
7: 00000000000003d0 159 FUNC LOCAL DEFAULT 2 libjit_stacked_kernel.1_3
8: 0000000000000470 159 FUNC LOCAL DEFAULT 2 libjit_stacked_kernel.2_4
9: 00000000000005e0 62 FUNC LOCAL DEFAULT 2 libjit_stacked_kernel.3_6
10: 0000000000000220 159 FUNC LOCAL DEFAULT 2 libjit_stacked_kernel_1_s
11: 0000000000000000 64 OBJECT LOCAL DEFAULT 6 xSymbolTable
12: 0000000000000000 0 SECTION LOCAL DEFAULT 5
13: 0000000000000000 0 SECTION LOCAL DEFAULT 6
14: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND expf
15: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND free
16: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND posix_memalign
17: 0000000000000000 158 FUNC GLOBAL DEFAULT 2 x
18: 0000000000000040 48 OBJECT GLOBAL DEFAULT 6 x_config
Ok, now dump .rodata:
(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -x .rodata ./output/x.o
Hex dump of section '.rodata':
0x00000000 73617665 5f6f7574 00696e00 save_out.in.
(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$
We only see save_out and in. Where are the other names? Compiling with -relocation-model=static gives:
(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -x .rodata ./output/x.o
Hex dump of section '.rodata':
NOTE: This section has relocations against it, but these have NOT been applied to this dump.
0x00000000 73617665 5f6f7574 00696e00 00000000 save_out.in.....
0x00000010 00000000 00000000 00000000 00000000 ................
0x00000020 01000000 00000000 01000000 00000000 ................
0x00000030 00000000 00000000 40000000 00000000 ........@.......
0x00000040 02000000 00000000 01000000 00000000 ................
0x00000050 80030000 00000000 80000000 00000000 ................
0x00000060 80000000 00000000 40000000 00000000 ........@.......
0x00000070 08000000 00000000 00000000 00000000 ................
The names have just been relocated, that's all. Now viewing relocations:
(base) vagrant@ubuntu-18:~/dts/nn/xgmr/xperi-glow/build/debug$ readelf -r ./output/x.o
Relocation section '.rela.text' at offset 0x960 contains 9 entries:
Offset Info Type Sym. Value Sym. Name + Addend
0000000000cd 000f00000004 R_X86_64_PLT32 0000000000000000 posix_memalign - 4
00000000020a 000e00000004 R_X86_64_PLT32 0000000000000000 free - 4
0000000002ed 000f00000004 R_X86_64_PLT32 0000000000000000 posix_memalign - 4
0000000003c1 000e00000004 R_X86_64_PLT32 0000000000000000 free - 4
000000000538 000f00000004 R_X86_64_PLT32 0000000000000000 posix_memalign - 4
0000000005c4 000e00000004 R_X86_64_PLT32 0000000000000000 free - 4
0000000005f6 000200000002 R_X86_64_PC32 0000000000000000 .LCPI7_0 - 4
0000000005ff 000d00000004 R_X86_64_PLT32 0000000000000000 expf - 4
000000000607 000300000002 R_X86_64_PC32 0000000000000004 .LCPI7_1 - 4
Relocation section '.rela.rodata' at offset 0xa38 contains 3 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000010 000c00000001 R_X86_64_64 0000000000000000 .rodata + 0
000000000030 000c00000001 R_X86_64_64 0000000000000000 .rodata + 9
000000000078 000c00000001 R_X86_64_64 0000000000000000 .rodata + 10
Hmm. Interesting.
First, the names in the SymbolTable inside BundleConfig do not seem to be consistent with those given in the ONNX graph.
Yes, the names used by glow are not 100% the same as the original ones. There are multiple reasons for that:
Where are the other names?
What names do you expect to see? In general, the symbol table contains only the names of input/output tensors.
Could you actually show the result of -dump-ir on your model? It would give a better idea of names used by the model as Glow sees them.
What names do you expect to see? In general, the symbol table contains only the names of input/output tensors.
Yep, that is what I expect. However, when looking for the offset of those tensors, I run through the SymbolTable, grabbing all names in it and comparing to in and save_out. Once found, I grab the corresponding offset. BundleConfig is reporting 8 symbols in the symbol table. So I run over the 8 symbols, while, apparently, only 2 of those 8 contain "valid names."
So is there a better way to traverse the symbol table to identify the input and output tensor offsets? For instance, is it guaranteed that the output tensor is always at index 0 in the symbol table, and the input is always at index 1? Moreover, this method seems fragile also because, as you said, Glow can mangle original names.
Could you actually show the result of -dump-ir on your model? It would give a better idea of names used by the model as Glow sees them.
Yep, here it is:
declare {
%A21 = WeightVar float<1 x 8> const // size: 32 // Users: @in 2
%A41 = WeightVar float<1 x 8> const // size: 32 // Users: @in 7
%A61 = WeightVar float<1 x 8> const // size: 32 // Users: @in 12
%A81 = WeightVar float<1 x 1> const // size: 4 // Users: @in 16
%A71 = WeightVar float<8 x 1> const // size: 32 // Users: @in 14
%A11 = WeightVar float<2 x 8> const // size: 64 // Users: @in 1
%A31 = WeightVar float<8 x 8> const // size: 256 // Users: @in 5
%A51 = WeightVar float<8 x 8> const // size: 256 // Users: @in 10
%in = WeightVar float<1 x 2> mutable // size: 8 // Users: @in 1
%save_out = WeightVar float<1 x 1> mutable // size: 4 // Users: @in 16, @out 14, @in 17, @out 16, @out 17
; size = 720 bytes
}
code {
0 %A92_res = allocactivation { Ty: float<1 x 8>} // size: 32 // Users: @in 2, @out 1, @in 5, @out 3, @out 6, @in 3, @out 2
1 %A91 = matmul @out %A92_res, @in %in, @in %A11
2 %A92 = elementadd @out %A92_res, @in %A92_res, @in %A21
3 %relu31 = cpumaxsplat @out %A92_res, @in %A92_res { SplatValue: 0.000000e+00}
4 %A112_res = allocactivation { Ty: float<1 x 8>} // size: 32 // Users: @in 7, @out 5, @in 10, @out 8, @out 11, @in 8, @out 7
5 %A111 = matmul @out %A112_res, @in %A92_res, @in %A31
6 %dealloc1 = deallocactivation @out %A92_res // size: 32
7 %A112 = elementadd @out %A112_res, @in %A112_res, @in %A41
8 %relu111 = cpumaxsplat @out %A112_res, @in %A112_res { SplatValue: 0.000000e+00}
9 %A132_res = allocactivation { Ty: float<1 x 8>} // size: 32 // Users: @in 12, @out 10, @in 14, @out 13, @out 15, @in 13, @out 12
10 %A131 = matmul @out %A132_res, @in %A112_res, @in %A51
11 %dealloc4 = deallocactivation @out %A112_res // size: 32
12 %A132 = elementadd @out %A132_res, @in %A132_res, @in %A61
13 %relu211 = cpumaxsplat @out %A132_res, @in %A132_res { SplatValue: 0.000000e+00}
14 %A151 = matmul @out %save_out, @in %A132_res, @in %A71
15 %dealloc7 = deallocactivation @out %A132_res // size: 32
16 %A152 = elementadd @out %save_out, @in %save_out, @in %A81
17 %out = sigmoid @out %save_out, @in %save_out
}
Ok! We can see in the ir-dump that only two WeightVars are marked as mutable, which is the indication that they are inputs or outputs. All others are constants, i.e. weights and thus their names are not saved in the symbol table currently (though we could if there is interest in this).
But I suspect that the number of entries reported in the symbol table may be wrong. Need to check a couple of things to confirm it.
@opti-mix Thanks for looking into this! I had that suspicion too, but wasn't completely sure (so edited the original post to remove "could BundleConfig::numSymbols be wrong?"). It is the only logical explanation for SIGSEGV that I could find so far.
From BundleSaver.cpp in void BundleSaver::emitBundleConfig():
config->setInitializer(llvm::ConstantStruct::get(
bundleConfigTy,
llvm::ConstantInt::get(
uint64TType, irgen_->getAllocationsInfo().constantWeightVarsMemSize_),
llvm::ConstantInt::get(
uint64TType, irgen_->getAllocationsInfo().mutableWeightVarsMemSize_),
llvm::ConstantInt::get(uint64TType,
irgen_->getAllocationsInfo().activationsMemSize_),
llvm::ConstantInt::get(uint64TType, TensorAlignment),
llvm::ConstantInt::get(uint64TType, F_->findConstants().size()),
symbolTable));
Should the line
llvm::ConstantInt::get(uint64TType, F_->findConstants().size())
instead be
llvm::ConstantInt::get(uint64TType, F_->findPlaceholders().size())
Yep, I can confirm that changing to findPlaceholders() works correctly.
Very good catch! findConstants seems to be obviously wrong to me.
But even findPlaceholders could be not precise enough. The thing is that placeholders are defined for the whole module, but of them may be not actually used in the code and no WeightVars are allocated for them. A symbol table probably should not contain any entries about such placeholders as they are useless.
Yep, I can confirm that changing to findPlaceholders() works correctly.
I guess now you are going to promise me a 3rd PR, or? :-D
A symbol table probably should not contain any entries about such placeholders as they are useless.
Agreed. Let's keep this issue open until the correct fix is applied. As of right now, I am just going to do this dirty fix on my end to keep things rolling here.
I guess now you are going to promise me a 3rd PR, or? :-D
Of course.
The thing is that placeholders are defined for the whole module, but of them may be not actually used in the code and no WeightVars are allocated for them. A symbol table probably should not contain any entries about such placeholders as they are useless.
CC: @gcatron -- Function::findPlaceholders() / Function::findConstants() actually does the right thing here, I believe. Probably could have been named better.
The thing is that placeholders are defined for the whole module, but of them may be not actually used in the code and no WeightVars are allocated for them. A symbol table probably should not contain any entries about such placeholders as they are useless.
CC: @gcatron --
Function::findPlaceholders()/Function::findConstants()actually does the right thing here, I believe. Probably could have been named better.
Function::findPlaceholders() should be right here. It only finds placeholders used by the function which should all be used.
Heh, actually I found this yesterday and was going to put up that fix today.
@nickgg
Heh, actually I found this yesterday and was going to put up that fix today.
Please go ahead, don't wait on me. Also feel free to close this issue once the fix is in.