Issue description
The Linux PyPI packages are, for some reason, much larger than the Windows and OS X packages:

This used to be fine, by recently this causes the PyPI server to reject the Linux package. Now I'm getting
400 File too large. Limit for project 'taichi' is 60 MB. See https://pypi.org/help/#file-size-limit
when I try to upload the Linux package.
Possible causes
This is probably because the libtaichi_core.so file is much larger than its counterparts on OS X (libtaichi_core.dylib)/Windows(taichi_core.dll). This might be caused by different linker behaviors or something else on different platforms.
Could someone smart and brave take a look at this? We need to have this resolved to release v0.6.5.
Meanwhile, I have submitted a request to PyPI for a larger size. Not sure how soon they will respond though: https://github.com/pypa/pypi-support/issues/398
Temporary solution given the urgent label:
ld optimize link, i.e. remove unneeded functions like metal_xxx?-Os for clang to optimize size, but lose performance compared to -O3.Btw, can you do some elf-dumping work to figure out why is linux package so large than win/osx ones?
I think this is because the Linux .so included the debug symbols. I don't have the dev env on my Linux box to build 0.6.5, so I'm using the released 0.6.4 as an example.
If you print out the size of each section:
$ size -A -d taichi_core.so
taichi_core.so :
section size addr
.gnu.hash 223692 456
.dynsym 768600 224152
.dynstr 2115769 992752
.gnu.version 64050 3108522
.gnu.version_r 672 3172576
.rela.dyn 2201184 3173248
.rela.plt 324720 5374432
.init 23 5699152
.plt 216496 5699184
.plt.got 456 5915680
.text 27982847 5916144
.fini 9 33898992
.rodata 11013628 33899008
.eh_frame_hdr 334316 44912636
.eh_frame 2410540 45246952
.gcc_except_table 1291908 47657492
.tbss 32 51047648
.init_array 2888 51047648
.fini_array 8 51050536
.data.rel.ro 2021760 51050544
.dynamic 608 53072304
.got 48104 53072912
.got.plt 108264 53121024
.data 31440 53229296
.bss 385440 53260736
.comment 104 0
.debug_pubnames 18838258 0
.debug_info 69067416 0
.debug_abbrev 358438 0
.debug_line 13208507 0
.debug_str 13328657 0
.debug_loc 73602710 0
.debug_macinfo 136 0
.debug_pubtypes 8251971 0
.debug_ranges 13726288 0
Total 261929939
The debug sections actually takes 18838258 + 69067416 + 358438 + 13208507 + 13328657 + 73602710 + 136 + 8251971 + 13726288 = 210382381 ~= 210MB, while the total size is 262MB.
So the release build has also included the debug symbols, probably we put -g somewhere in the CMake rules?
You can strip away the debug symbols by using the strip command, e.g.
$ strip taichi_core.so
$ size -A -d taichi_core.so
taichi_core.so :
section size addr
.gnu.hash 223692 456
.dynsym 768600 224152
.dynstr 2115769 992752
.gnu.version 64050 3108522
.gnu.version_r 672 3172576
.rela.dyn 2201184 3173248
.rela.plt 324720 5374432
.init 23 5699152
.plt 216496 5699184
.plt.got 456 5915680
.text 27982847 5916144
.fini 9 33898992
.rodata 11013628 33899008
.eh_frame_hdr 334316 44912636
.eh_frame 2410540 45246952
.gcc_except_table 1291908 47657492
.tbss 32 51047648
.init_array 2888 51047648
.fini_array 8 51050536
.data.rel.ro 2021760 51050544
.dynamic 608 53072304
.got 48104 53072912
.got.plt 108264 53121024
.data 31440 53229296
.bss 385440 53260736
.comment 104 0
Total 51547558
All the debug info are gone, and the size is reduced to 51.5MB
As a comparison, on the Mac platform, we don't have such debug info to begin with:
objdump -section-headers taichi_core.so
taichi_core.so: file format Mach-O 64-bit x86-64
Sections:
Idx Name Size Address Type
0 __text 018a1a3a 0000000000003fd0 TEXT
1 __stubs 000013da 00000000018a5a0a TEXT
2 __stub_helper 00001354 00000000018a6de4 TEXT
3 __gcc_except_tab 0002f03c 00000000018a8138 DATA
4 __const 002790d0 00000000018d7180 DATA
5 __cstring 00184c26 0000000001b50250 DATA
6 __ustring 0000001c 0000000001cd4e76 DATA
7 __unwind_info 0001e0c8 0000000001cd4e94 DATA
8 __eh_frame 000060a0 0000000001cf2f60 DATA
9 __nl_symbol_ptr 00000010 0000000001cf9000 DATA
10 __got 00002cc0 0000000001cf9010 DATA
11 __la_symbol_ptr 00001a78 0000000001cfbcd0 DATA
12 __mod_init_func 00000838 0000000001cfd748 DATA
13 __const 001af2a0 0000000001cfdf80 DATA
14 __data 000056c0 0000000001ead220 DATA
15 __thread_vars 00000060 0000000001eb28e0 DATA
16 __thread_ptrs 00000018 0000000001eb2940 DATA
17 __thread_bss 00000020 0000000001eb2958 DATA
18 __common 00027c98 0000000001eb2980 BSS
19 __bss 00027198 0000000001eda620 BSS
Its total size is 39MB.
At last, I don't think #529 or any sort of platform-dependent code is the culprit for such kind of issues, unless we know it's occupying a significant part of the codebase.
For Metal, the runtime part/APIs are already conditionally enabled only on the Mac platform. The things you are seeing are only the codegen part, but we have codegens for LLVM/Metal/OpenGL. Assuming each one is of the same complexity, and contributes roughly the same amount of data to the final .so, I don't see why disabling one particular codegen could significantly reduce the total size.
Another thing is that, we probably can consider releasing the debug info as well, just not embed it in the shared lib. That way if some users really want to debug Taichi, they can at least do it without completely being blind.. E.g. https://stackoverflow.com/questions/866721/how-to-generate-gcc-debug-symbol-outside-the-build-target
Thank you for the valuable discussions! Following the suggestions by @k-ye, I did a benchmark (Default cmake build type: RelWithDebInfo):
Release instead of RelWithDebInfo and remove -g: 53 MBRelease instead of RelWithDebInfo and without removing -g: 258 MB(-g means https://github.com/taichi-dev/taichi/blob/0c1bb390ef79008f6c7c44a039665622ef546bba/cmake/TaichiCXXFlags.cmake#L69)
So we have to do both: remove -g and use Release instead of RelWithDebInfo. As I guessed, RelWithDebInfo implicitly adds -g.
So we have to do both: remove -g and use Release instead of RelWithDebInfo. As I guessed, RelWithDebInfo implicitly adds -g.
BTW, maybe we should have two build modes: Release vs Debug (Dev)? Otherwise I guess the printed stack would be completely trash when we are developing.. https://stackoverflow.com/a/7725055/12003165
Another thing is that, we probably can consider releasing the debug info as well, just not embed it in the shared lib. That way if some users really want to debug Taichi, they can at least do it without completely being blind.. E.g. https://stackoverflow.com/questions/866721/how-to-generate-gcc-debug-symbol-outside-the-build-target
The interesting thing is that even without debug info the stack backtrace system still works (I guess global functions names are not really part of debug info):
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-ndcffy4r
[Taichi] sandbox prepared
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 660e9738, python 3.6.9
[E 05/19/20 09:55:29.695] [codegen_cuda.cpp:visit@398] test
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenLLVMCUDA::visit(taichi::lang::OffloadedStmt*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenLLVM::visit(taichi::lang::Block*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenLLVM::emit_to_module()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenLLVM::gen()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenCUDA::codegen()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::KernelCodeGen::compile()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Program::compile(taichi::lang::Kernel&)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::compile()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::operator()()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::ConstantFold::visit(taichi::lang::UnaryOpStmt*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::BasicStmtVisitor::visit(taichi::lang::Block*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::BasicStmtVisitor::visit(taichi::lang::Block*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::irpass::constant_fold(taichi::lang::IRNode*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::irpass::full_simplify(taichi::lang::IRNode*, taichi::lang::CompileConfig const&, taichi::lang::Kernel*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::irpass::compile_to_offloads(taichi::lang::IRNode*, taichi::lang::CompileConfig const&, bool, bool, bool, bool, bool)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::lower(bool)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Program::compile(taichi::lang::Kernel&)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::compile()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::operator()()
/tmp/taichi-ndcffy4r/taichi_core.so(+0x707c34) [0x7f668b887c34]
/tmp/taichi-ndcffy4r/taichi_core.so(+0x664137) [0x7f668b7e4137]
python3.6(_PyCFunction_FastCallDict+0x154) [0x560e3934bc54]
python3.6(_PyObject_FastCallDict+0x2bf) [0x560e3934c06f]
python3.6(_PyObject_Call_Prepend+0x63) [0x560e39350aa3]
python3.6(PyObject_Call+0x3e) [0x560e3934ba5e]
python3.6(+0x16b371) [0x560e393a5371]
python3.6(_PyObject_FastCallDict+0x8b) [0x560e3934be3b]
python3.6(+0x199c0e) [0x560e393d3c0e]
python3.6(_PyEval_EvalFrameDefault+0x30a) [0x560e393f675a]
python3.6(PyEval_EvalCodeEx+0x966) [0x560e393ceff6]
python3.6(+0x1957d4) [0x560e393cf7d4]
python3.6(PyObject_Call+0x3e) [0x560e3934ba5e]
python3.6(_PyEval_EvalFrameDefault+0x19e7) [0x560e393f7e37]
python3.6(+0x192e66) [0x560e393cce66]
python3.6(_PyFunction_FastCallDict+0x3d8) [0x560e393ce598]
python3.6(_PyObject_FastCallDict+0x26f) [0x560e3934c01f]
python3.6(_PyObject_Call_Prepend+0x63) [0x560e39350aa3]
python3.6(PyObject_Call+0x3e) [0x560e3934ba5e]
python3.6(+0x16b371) [0x560e393a5371]
python3.6(PyObject_Call+0x3e) [0x560e3934ba5e]
python3.6(_PyEval_EvalFrameDefault+0x19e7) [0x560e393f7e37]
python3.6(+0x193136) [0x560e393cd136]
python3.6(+0x193ed6) [0x560e393cded6]
python3.6(+0x199b95) [0x560e393d3b95]
python3.6(_PyEval_EvalFrameDefault+0x30a) [0x560e393f675a]
python3.6(PyEval_EvalCodeEx+0x329) [0x560e393ce9b9]
python3.6(PyEval_EvalCode+0x1c) [0x560e393cf75c]
python3.6(+0x215744) [0x560e3944f744]
python3.6(PyRun_FileExFlags+0xa1) [0x560e3944fb41]
python3.6(PyRun_SimpleFileExFlags+0x1c3) [0x560e3944fd43]
python3.6(Py_Main+0x613) [0x560e39453833]
python3.6(main+0xee) [0x560e3931d88e]
/lib/x86_64-linux-gnu/libc.so.6: __libc_start_main
python3.6(+0x1c3160) [0x560e393fd160]
I guess this is enough for Linux users to report what's happening.
So we have to do both: remove -g and use Release instead of RelWithDebInfo. As I guessed, RelWithDebInfo implicitly adds -g.
BTW, maybe we should have two build modes: Release vs Debug (Dev)? Otherwise I guess the printed stack would be completely trash when we are developing.. https://stackoverflow.com/a/7725055/12003165
Right, I was worried about that. After the experiment above I think we can safely release without debug info, since we never make use of further information than the global function names... I'll go ahead to fix the Linux PyPI release. Thanks again for pointing out what's happening here!
Most helpful comment
I think this is because the Linux
.soincluded the debug symbols. I don't have the dev env on my Linux box to build0.6.5, so I'm using the released0.6.4as an example.If you print out the size of each section:
The debug sections actually takes
18838258 + 69067416 + 358438 + 13208507 + 13328657 + 73602710 + 136 + 8251971 + 13726288 = 210382381 ~= 210MB, while the total size is262MB.So the release build has also included the debug symbols, probably we put
-gsomewhere in the CMake rules?https://github.com/taichi-dev/taichi/blob/0c1bb390ef79008f6c7c44a039665622ef546bba/cmake/TaichiCXXFlags.cmake#L69
You can strip away the debug symbols by using the
stripcommand, e.g.All the debug info are gone, and the size is reduced to
51.5MBAs a comparison, on the Mac platform, we don't have such debug info to begin with:
Its total size is
39MB.At last, I don't think #529 or any sort of platform-dependent code is the culprit for such kind of issues, unless we know it's occupying a significant part of the codebase.
For Metal, the runtime part/APIs are already conditionally enabled only on the Mac platform. The things you are seeing are only the codegen part, but we have codegens for LLVM/Metal/OpenGL. Assuming each one is of the same complexity, and contributes roughly the same amount of data to the final
.so, I don't see why disabling one particular codegen could significantly reduce the total size.