Taichi: [PyPI] Linux package much bigger than Windows and OS X

Created on 19 May 2020  路  9Comments  路  Source: taichi-dev/taichi

Issue description

The Linux PyPI packages are, for some reason, much larger than the Windows and OS X packages:

image

This used to be fine, by recently this causes the PyPI server to reject the Linux package. Now I'm getting

400 File too large. Limit for project 'taichi' is 60 MB. See https://pypi.org/help/#file-size-limit

when I try to upload the Linux package.

Possible causes

This is probably because the libtaichi_core.so file is much larger than its counterparts on OS X (libtaichi_core.dylib)/Windows(taichi_core.dll). This might be caused by different linker behaviors or something else on different platforms.

Could someone smart and brave take a look at this? We need to have this resolved to release v0.6.5.

help wanted linux urgent welcome contribution

Most helpful comment

I think this is because the Linux .so included the debug symbols. I don't have the dev env on my Linux box to build 0.6.5, so I'm using the released 0.6.4 as an example.

If you print out the size of each section:

$ size  -A -d taichi_core.so 

taichi_core.so  :
section                  size       addr
.gnu.hash              223692        456
.dynsym                768600     224152
.dynstr               2115769     992752
.gnu.version            64050    3108522
.gnu.version_r            672    3172576
.rela.dyn             2201184    3173248
.rela.plt              324720    5374432
.init                      23    5699152
.plt                   216496    5699184
.plt.got                  456    5915680
.text                27982847    5916144
.fini                       9   33898992
.rodata              11013628   33899008
.eh_frame_hdr          334316   44912636
.eh_frame             2410540   45246952
.gcc_except_table     1291908   47657492
.tbss                      32   51047648
.init_array              2888   51047648
.fini_array                 8   51050536
.data.rel.ro          2021760   51050544
.dynamic                  608   53072304
.got                    48104   53072912
.got.plt               108264   53121024
.data                   31440   53229296
.bss                   385440   53260736
.comment                  104          0
.debug_pubnames      18838258          0
.debug_info          69067416          0
.debug_abbrev          358438          0
.debug_line          13208507          0
.debug_str           13328657          0
.debug_loc           73602710          0
.debug_macinfo            136          0
.debug_pubtypes       8251971          0
.debug_ranges        13726288          0
Total               261929939

The debug sections actually takes 18838258 + 69067416 + 358438 + 13208507 + 13328657 + 73602710 + 136 + 8251971 + 13726288 = 210382381 ~= 210MB, while the total size is 262MB.

So the release build has also included the debug symbols, probably we put -g somewhere in the CMake rules?

https://github.com/taichi-dev/taichi/blob/0c1bb390ef79008f6c7c44a039665622ef546bba/cmake/TaichiCXXFlags.cmake#L69

You can strip away the debug symbols by using the strip command, e.g.

$ strip taichi_core.so 
$ size -A -d taichi_core.so

taichi_core.so  :
section                 size       addr
.gnu.hash             223692        456
.dynsym               768600     224152
.dynstr              2115769     992752
.gnu.version           64050    3108522
.gnu.version_r           672    3172576
.rela.dyn            2201184    3173248
.rela.plt             324720    5374432
.init                     23    5699152
.plt                  216496    5699184
.plt.got                 456    5915680
.text               27982847    5916144
.fini                      9   33898992
.rodata             11013628   33899008
.eh_frame_hdr         334316   44912636
.eh_frame            2410540   45246952
.gcc_except_table    1291908   47657492
.tbss                     32   51047648
.init_array             2888   51047648
.fini_array                8   51050536
.data.rel.ro         2021760   51050544
.dynamic                 608   53072304
.got                   48104   53072912
.got.plt              108264   53121024
.data                  31440   53229296
.bss                  385440   53260736
.comment                 104          0
Total               51547558

All the debug info are gone, and the size is reduced to 51.5MB

As a comparison, on the Mac platform, we don't have such debug info to begin with:

objdump -section-headers taichi_core.so 

taichi_core.so: file format Mach-O 64-bit x86-64

Sections:
Idx Name          Size      Address          Type
  0 __text        018a1a3a 0000000000003fd0 TEXT 
  1 __stubs       000013da 00000000018a5a0a TEXT 
  2 __stub_helper 00001354 00000000018a6de4 TEXT 
  3 __gcc_except_tab 0002f03c 00000000018a8138 DATA 
  4 __const       002790d0 00000000018d7180 DATA 
  5 __cstring     00184c26 0000000001b50250 DATA 
  6 __ustring     0000001c 0000000001cd4e76 DATA 
  7 __unwind_info 0001e0c8 0000000001cd4e94 DATA 
  8 __eh_frame    000060a0 0000000001cf2f60 DATA 
  9 __nl_symbol_ptr 00000010 0000000001cf9000 DATA 
 10 __got         00002cc0 0000000001cf9010 DATA 
 11 __la_symbol_ptr 00001a78 0000000001cfbcd0 DATA 
 12 __mod_init_func 00000838 0000000001cfd748 DATA 
 13 __const       001af2a0 0000000001cfdf80 DATA 
 14 __data        000056c0 0000000001ead220 DATA 
 15 __thread_vars 00000060 0000000001eb28e0 DATA 
 16 __thread_ptrs 00000018 0000000001eb2940 DATA 
 17 __thread_bss  00000020 0000000001eb2958 DATA 
 18 __common      00027c98 0000000001eb2980 BSS
 19 __bss         00027198 0000000001eda620 BSS

Its total size is 39MB.


At last, I don't think #529 or any sort of platform-dependent code is the culprit for such kind of issues, unless we know it's occupying a significant part of the codebase.

For Metal, the runtime part/APIs are already conditionally enabled only on the Mac platform. The things you are seeing are only the codegen part, but we have codegens for LLVM/Metal/OpenGL. Assuming each one is of the same complexity, and contributes roughly the same amount of data to the final .so, I don't see why disabling one particular codegen could significantly reduce the total size.

All 9 comments

Meanwhile, I have submitted a request to PyPI for a larger size. Not sure how soon they will respond though: https://github.com/pypa/pypi-support/issues/398

529 can solve this systematically, but require much more time than you want.

Temporary solution given the urgent label:

  1. Is that possible to make ld optimize link, i.e. remove unneeded functions like metal_xxx?
  2. Try specify -Os for clang to optimize size, but lose performance compared to -O3.

Btw, can you do some elf-dumping work to figure out why is linux package so large than win/osx ones?

I think this is because the Linux .so included the debug symbols. I don't have the dev env on my Linux box to build 0.6.5, so I'm using the released 0.6.4 as an example.

If you print out the size of each section:

$ size  -A -d taichi_core.so 

taichi_core.so  :
section                  size       addr
.gnu.hash              223692        456
.dynsym                768600     224152
.dynstr               2115769     992752
.gnu.version            64050    3108522
.gnu.version_r            672    3172576
.rela.dyn             2201184    3173248
.rela.plt              324720    5374432
.init                      23    5699152
.plt                   216496    5699184
.plt.got                  456    5915680
.text                27982847    5916144
.fini                       9   33898992
.rodata              11013628   33899008
.eh_frame_hdr          334316   44912636
.eh_frame             2410540   45246952
.gcc_except_table     1291908   47657492
.tbss                      32   51047648
.init_array              2888   51047648
.fini_array                 8   51050536
.data.rel.ro          2021760   51050544
.dynamic                  608   53072304
.got                    48104   53072912
.got.plt               108264   53121024
.data                   31440   53229296
.bss                   385440   53260736
.comment                  104          0
.debug_pubnames      18838258          0
.debug_info          69067416          0
.debug_abbrev          358438          0
.debug_line          13208507          0
.debug_str           13328657          0
.debug_loc           73602710          0
.debug_macinfo            136          0
.debug_pubtypes       8251971          0
.debug_ranges        13726288          0
Total               261929939

The debug sections actually takes 18838258 + 69067416 + 358438 + 13208507 + 13328657 + 73602710 + 136 + 8251971 + 13726288 = 210382381 ~= 210MB, while the total size is 262MB.

So the release build has also included the debug symbols, probably we put -g somewhere in the CMake rules?

https://github.com/taichi-dev/taichi/blob/0c1bb390ef79008f6c7c44a039665622ef546bba/cmake/TaichiCXXFlags.cmake#L69

You can strip away the debug symbols by using the strip command, e.g.

$ strip taichi_core.so 
$ size -A -d taichi_core.so

taichi_core.so  :
section                 size       addr
.gnu.hash             223692        456
.dynsym               768600     224152
.dynstr              2115769     992752
.gnu.version           64050    3108522
.gnu.version_r           672    3172576
.rela.dyn            2201184    3173248
.rela.plt             324720    5374432
.init                     23    5699152
.plt                  216496    5699184
.plt.got                 456    5915680
.text               27982847    5916144
.fini                      9   33898992
.rodata             11013628   33899008
.eh_frame_hdr         334316   44912636
.eh_frame            2410540   45246952
.gcc_except_table    1291908   47657492
.tbss                     32   51047648
.init_array             2888   51047648
.fini_array                8   51050536
.data.rel.ro         2021760   51050544
.dynamic                 608   53072304
.got                   48104   53072912
.got.plt              108264   53121024
.data                  31440   53229296
.bss                  385440   53260736
.comment                 104          0
Total               51547558

All the debug info are gone, and the size is reduced to 51.5MB

As a comparison, on the Mac platform, we don't have such debug info to begin with:

objdump -section-headers taichi_core.so 

taichi_core.so: file format Mach-O 64-bit x86-64

Sections:
Idx Name          Size      Address          Type
  0 __text        018a1a3a 0000000000003fd0 TEXT 
  1 __stubs       000013da 00000000018a5a0a TEXT 
  2 __stub_helper 00001354 00000000018a6de4 TEXT 
  3 __gcc_except_tab 0002f03c 00000000018a8138 DATA 
  4 __const       002790d0 00000000018d7180 DATA 
  5 __cstring     00184c26 0000000001b50250 DATA 
  6 __ustring     0000001c 0000000001cd4e76 DATA 
  7 __unwind_info 0001e0c8 0000000001cd4e94 DATA 
  8 __eh_frame    000060a0 0000000001cf2f60 DATA 
  9 __nl_symbol_ptr 00000010 0000000001cf9000 DATA 
 10 __got         00002cc0 0000000001cf9010 DATA 
 11 __la_symbol_ptr 00001a78 0000000001cfbcd0 DATA 
 12 __mod_init_func 00000838 0000000001cfd748 DATA 
 13 __const       001af2a0 0000000001cfdf80 DATA 
 14 __data        000056c0 0000000001ead220 DATA 
 15 __thread_vars 00000060 0000000001eb28e0 DATA 
 16 __thread_ptrs 00000018 0000000001eb2940 DATA 
 17 __thread_bss  00000020 0000000001eb2958 DATA 
 18 __common      00027c98 0000000001eb2980 BSS
 19 __bss         00027198 0000000001eda620 BSS

Its total size is 39MB.


At last, I don't think #529 or any sort of platform-dependent code is the culprit for such kind of issues, unless we know it's occupying a significant part of the codebase.

For Metal, the runtime part/APIs are already conditionally enabled only on the Mac platform. The things you are seeing are only the codegen part, but we have codegens for LLVM/Metal/OpenGL. Assuming each one is of the same complexity, and contributes roughly the same amount of data to the final .so, I don't see why disabling one particular codegen could significantly reduce the total size.

Another thing is that, we probably can consider releasing the debug info as well, just not embed it in the shared lib. That way if some users really want to debug Taichi, they can at least do it without completely being blind.. E.g. https://stackoverflow.com/questions/866721/how-to-generate-gcc-debug-symbol-outside-the-build-target

Thank you for the valuable discussions! Following the suggestions by @k-ye, I did a benchmark (Default cmake build type: RelWithDebInfo):

  • Original size: 258 MB
  • Removing -g: 258 MB
  • Build with Release instead of RelWithDebInfo and remove -g: 53 MB
  • Build with Release instead of RelWithDebInfo and without removing -g: 258 MB

(-g means https://github.com/taichi-dev/taichi/blob/0c1bb390ef79008f6c7c44a039665622ef546bba/cmake/TaichiCXXFlags.cmake#L69)

So we have to do both: remove -g and use Release instead of RelWithDebInfo. As I guessed, RelWithDebInfo implicitly adds -g.

So we have to do both: remove -g and use Release instead of RelWithDebInfo. As I guessed, RelWithDebInfo implicitly adds -g.

BTW, maybe we should have two build modes: Release vs Debug (Dev)? Otherwise I guess the printed stack would be completely trash when we are developing.. https://stackoverflow.com/a/7725055/12003165

Another thing is that, we probably can consider releasing the debug info as well, just not embed it in the shared lib. That way if some users really want to debug Taichi, they can at least do it without completely being blind.. E.g. https://stackoverflow.com/questions/866721/how-to-generate-gcc-debug-symbol-outside-the-build-target

The interesting thing is that even without debug info the stack backtrace system still works (I guess global functions names are not really part of debug info):

[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-ndcffy4r
[Taichi] sandbox prepared
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 660e9738, python 3.6.9
[E 05/19/20 09:55:29.695] [codegen_cuda.cpp:visit@398] test


***********************************
* Taichi Compiler Stack Traceback *
***********************************
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenLLVMCUDA::visit(taichi::lang::OffloadedStmt*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenLLVM::visit(taichi::lang::Block*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenLLVM::emit_to_module()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenLLVM::gen()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::CodeGenCUDA::codegen()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::KernelCodeGen::compile()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Program::compile(taichi::lang::Kernel&)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::compile()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::operator()()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::ConstantFold::visit(taichi::lang::UnaryOpStmt*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::BasicStmtVisitor::visit(taichi::lang::Block*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::BasicStmtVisitor::visit(taichi::lang::Block*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::irpass::constant_fold(taichi::lang::IRNode*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::irpass::full_simplify(taichi::lang::IRNode*, taichi::lang::CompileConfig const&, taichi::lang::Kernel*)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::irpass::compile_to_offloads(taichi::lang::IRNode*, taichi::lang::CompileConfig const&, bool, bool, bool, bool, bool)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::lower(bool)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Program::compile(taichi::lang::Kernel&)
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::compile()
/tmp/taichi-ndcffy4r/taichi_core.so: taichi::lang::Kernel::operator()()
/tmp/taichi-ndcffy4r/taichi_core.so(+0x707c34) [0x7f668b887c34]
/tmp/taichi-ndcffy4r/taichi_core.so(+0x664137) [0x7f668b7e4137]
python3.6(_PyCFunction_FastCallDict+0x154) [0x560e3934bc54]
python3.6(_PyObject_FastCallDict+0x2bf) [0x560e3934c06f]
python3.6(_PyObject_Call_Prepend+0x63) [0x560e39350aa3]
python3.6(PyObject_Call+0x3e) [0x560e3934ba5e]
python3.6(+0x16b371) [0x560e393a5371]
python3.6(_PyObject_FastCallDict+0x8b) [0x560e3934be3b]
python3.6(+0x199c0e) [0x560e393d3c0e]
python3.6(_PyEval_EvalFrameDefault+0x30a) [0x560e393f675a]
python3.6(PyEval_EvalCodeEx+0x966) [0x560e393ceff6]
python3.6(+0x1957d4) [0x560e393cf7d4]
python3.6(PyObject_Call+0x3e) [0x560e3934ba5e]
python3.6(_PyEval_EvalFrameDefault+0x19e7) [0x560e393f7e37]
python3.6(+0x192e66) [0x560e393cce66]
python3.6(_PyFunction_FastCallDict+0x3d8) [0x560e393ce598]
python3.6(_PyObject_FastCallDict+0x26f) [0x560e3934c01f]
python3.6(_PyObject_Call_Prepend+0x63) [0x560e39350aa3]
python3.6(PyObject_Call+0x3e) [0x560e3934ba5e]
python3.6(+0x16b371) [0x560e393a5371]
python3.6(PyObject_Call+0x3e) [0x560e3934ba5e]
python3.6(_PyEval_EvalFrameDefault+0x19e7) [0x560e393f7e37]
python3.6(+0x193136) [0x560e393cd136]
python3.6(+0x193ed6) [0x560e393cded6]
python3.6(+0x199b95) [0x560e393d3b95]
python3.6(_PyEval_EvalFrameDefault+0x30a) [0x560e393f675a]
python3.6(PyEval_EvalCodeEx+0x329) [0x560e393ce9b9]
python3.6(PyEval_EvalCode+0x1c) [0x560e393cf75c]
python3.6(+0x215744) [0x560e3944f744]
python3.6(PyRun_FileExFlags+0xa1) [0x560e3944fb41]
python3.6(PyRun_SimpleFileExFlags+0x1c3) [0x560e3944fd43]
python3.6(Py_Main+0x613) [0x560e39453833]
python3.6(main+0xee) [0x560e3931d88e]
/lib/x86_64-linux-gnu/libc.so.6: __libc_start_main
python3.6(+0x1c3160) [0x560e393fd160]

I guess this is enough for Linux users to report what's happening.

So we have to do both: remove -g and use Release instead of RelWithDebInfo. As I guessed, RelWithDebInfo implicitly adds -g.

BTW, maybe we should have two build modes: Release vs Debug (Dev)? Otherwise I guess the printed stack would be completely trash when we are developing.. https://stackoverflow.com/a/7725055/12003165

Right, I was worried about that. After the experiment above I think we can safely release without debug info, since we never make use of further information than the global function names... I'll go ahead to fix the Linux PyPI release. Thanks again for pointing out what's happening here!

Was this page helpful?
0 / 5 - 0 ratings