Hi,
I am trying to port the entire runtime to qnx7 platform on x64 arch. I am able to build coreclr but it won't run unless I have dotnet executable built. Any suggestions on how to build corehost for qnx?
Any suggestions on how to build corehost for qnx?
The same way as coreclr? It lives under https://github.com/dotnet/runtime/tree/master/src/installer/corehost
How about the .nuget packages downloaded for specific RID? I used this repo https://github.com/dotnet/core-setup/tree/v2.2.8, when I tried on linux, it pulls down some .nuget files for linux platform, but I don't have these files for QNX to pull down.
You may want to build it from dotnet/runtime repo. dotnet/runtime has everything together that avoids the issues with publishing and downloading packages between repos.
@jkotas Oh. Thanks! Shall I start with all subprojects or only coreclr and corehost should be enough for me?
You can start src\coreclr, src\libraries\Native and corehost; and get the managed libraries from other Unix flavor.
Thanks! By saying managed libraries, do you mean the .dll libraries?
Right
@jkotas I tried the dotnet core 5.0.0-dev on linux and it can build a binary dotnet under artifacts directory, but when I tried to execute it, it gave me an error "A fatal error occurred. The folder [/home/
obj is directory for intermediate build files. It does not have the right directory layout.
Try the one under bin, e.g. artifacts/bin/testhost/netcoreapp5.0-linux-Debug-x64
@jkotas Thanks! I will try it out and let you know the progress.
@jkotas Can I publish my app to netcore sdk 5.0.0-dev? Or the other way around, can I build dotnet/runtime for sdk version 3? Following command will build a dotnet executable but it missed host folder and can't run from there. It doesn't build the artifacts/bin/testhost folder though.
/home/
What typically works best for initial bring ups like this is to publish standalone app (e.g. using dotnet publish -r linux-x64) and then overwrite the binaries what what you have built.
@jkotas Thanks! I tried replace the dotnet executable with my own built version of 5.0.0-dev and it seems working. So I think my next step is to build qnx version of following shared libraries and replace them, am I correct? Do I really need libuv.so and libe_sqlite3.so? They are under AspNet, not NetCore.
./shared/Microsoft.AspNetCore.All/2.2.8/libuv.so
./shared/Microsoft.AspNetCore.All/2.2.8/libe_sqlite3.so
./shared/Microsoft.NETCore.App/2.2.8/libhostpolicy.so
./shared/Microsoft.NETCore.App/2.2.8/System.Native.so
./shared/Microsoft.NETCore.App/2.2.8/libmscordbi.so
./shared/Microsoft.NETCore.App/2.2.8/libmscordaccore.so
./shared/Microsoft.NETCore.App/2.2.8/libcoreclr.so
./shared/Microsoft.NETCore.App/2.2.8/System.IO.Compression.Native.so
./shared/Microsoft.NETCore.App/2.2.8/System.Security.Cryptography.Native.OpenSsl.so
./shared/Microsoft.NETCore.App/2.2.8/libsos.so
./shared/Microsoft.NETCore.App/2.2.8/libcoreclrtraceptprovider.so
./shared/Microsoft.NETCore.App/2.2.8/libsosplugin.so
./shared/Microsoft.NETCore.App/2.2.8/System.Globalization.Native.so
./shared/Microsoft.NETCore.App/2.2.8/libclrjit.so
./shared/Microsoft.NETCore.App/2.2.8/System.Net.Http.Native.so
./shared/Microsoft.NETCore.App/2.2.8/libdbgshim.so
./shared/Microsoft.NETCore.App/2.2.8/System.Net.Security.Native.so
./host/fxr/2.2.8/libhostfxr.so
Do I really need libuv.so and libe_sqlite3.so?
It depends on the ASP.NET Core you are planning to use, and how you plan to configure it.
libuv is not required for ASP.NET Core (it is an optional provider for KestrelHttpServer, primary backend is .NET's own managed sockets).
libe_sqlite3 (which comes from https://github.com/ericsink/SQLitePCL.raw) is required only when EntityFramework Core is used with SQLite provider.
@am11 @jkotas Thanks!
Any idea how this shared library is built? ./shared/Microsoft.NETCore.App/2.2.8/System.Net.Http.Native.so, I didn't find it after built src/libraries/Native/build-native.sh
This library no longer exists in dotnet/runtime repo.
@jkotas Thanks! I will work on the rest then.
For the managed libraries (.dll), can I reuse 2.2.8 sdk version? Only replacing .so and .a libraries with my own built version.
You are likely going to run into mismatches when combining 2.2.8 managed libraries with latest native binaries from dotnet/runtime
I am able to build corehost but got a ELF error while executing it in a QNX device. I am debugging on why it happened.
@jkotas Is netcore 5 sdk available to try out?
Yes, you can download the daily builds at https://github.com/dotnet/core-sdk#installers-and-binaries
@jkotas Thanks!
I managed to build the dotnet executable using clang (built for QNX specifically), but when I used ldd to check dependencies, I got following error.
ldd: /tmp/dotnet: Exec format error
The readelf command showed following required libs and they are all present on the OS.
0x0000000000000001 (NEEDED) Shared library: [libm.so.3]
0x0000000000000001 (NEEDED) Shared library: [libiconv.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.4]
Any idea why this error occurred?
Exec format error
sounds like it got built for different architecture. if there is readelf(1) available, maybe try readelf -h $(command -v dotnet) | grep 'Class\|File\|Machine', e.g.:
$ readelf -h .dotnet/dotnet | grep 'Class\|File\|Machine'
Class: ELF64
Machine: Advanced Micro Devices X86-64
also does ldd -v /path/to/dotnet show something interesting?
btw, is there anything like vagrant box or a regular vm available for qnx 7 for developers or is myqnx account mandatory for devs as well?
@am11 It showed following for both Linux and QNX version of dotnet
Class: ELF64
Machine: Advanced Micro Devices X86-64
I would assume this is fine? There is a QNX version of ldd, when I ran it on QNX, it gave me "exec format error". I can't run QNX version of ldd on Linux.
As for your question, it is required to have a myqnx account to download sdk and tools for QNX, it is one month free trial license.
@guesshe, thanks, i was hoping for something like openqnx, which seems to also exist, but not sure how similar it is with QNX 7. :)
Exec format error from compiled code on same system typically indicates that the compiler/linker has somehow picked up the incompatible toolchain. If you could share the build output with commands that were executed, that might help spotting such issue.
Also, here are the ELF headers on Ubuntu (which I think should differ from QNX):
$ dotnet --version
3.1.100
$ readelf -h $(command -v dotnet)
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x408a2b
Start of program headers: 64 (bytes into file)
Start of section headers: 103952 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 9
Size of section headers: 64 (bytes)
Number of section headers: 30
Section header string table index: 29
Two FYIs, you can also:
./build.sh -gcc -subsetCategory installer -configuration ReleaseVERBOSE=1 ./build.sh -subsetCategory installer -configuration Release(assuming before -subsetCategory installer, -subsetCategory coreclr and -subsetCategory libraries were built using the same compiler; preferably start clean rm -rf artifacts or git clean -xdf)
@am11 Thanks! I am actually doing a cross-compiling. I will try your suggestions now.
@am11 /home/
Here is the link.txt under cmake generated build directory for dotnet executable.
/home/
@guesshe, the command looks correct for x86-64 target system. Is the target device (where you are getting Exec format error) using the same architecture or is it aarch64?
@am11 Thanks! I solved my issue by using qcc instead of clang. Maybe clang picked up something that messed up my cross-compiling environment. Now I am facing another issue where #define symlinkEntrypointExecutable "/proc/self/exefile" doesn't exist in QNX. I am working on finding an alternative solution.
symlinkEntrypointExecutable "/proc/self/exefile"
@guesshe, I had this problem when running dotnet in Linux emulator on FreeBSD. Although FreeBSD itself has a syscall for that https://github.com/dotnet/runtime/blob/b0351370ccd132d95c97b75312fc36adaacc2664/src/installer/corehost/cli/hostmisc/pal.unix.cpp#L698-L708 but emulator required mounting procfs to Linux chroot. You may want to try out the same on QNX https://www.qnx.com/developers/docs/6.5.0SP1.update/com.qnx.doc.neutrino_cookbook/s3_procfs.html to overcome this situation.
@am11 Thanks! I will add a QNX version of it.
I am able to build and run the dotnet executable. But when I pointed it to my published, self-contained hello_world application, it popped up following error. I suspect this has something to do with how I build coreclr (note my coreclr is still on version 2.2.8). Any idea how to debug this issue from coreclr perspective?
unknown symbol: _ZN3ETW5GCLog11FireGcStartEPNS0_14st_GCEventInfoE referenced from libcoreclr.so
unknown symbol: __tls_get_addr referenced from libcoreclr.so
@guesshe, perhaps it is due to FEATURE_EVENT_TRACE. We disabled this method when FEATURE_EVENT_TRACE is disabled for Android, just a few days ago: https://github.com/dotnet/runtime/blob/91f14182958b0fad9c9b4dc7d908ff955581979b/src/coreclr/src/vm/gctoclreventsink.cpp. Reason for disabling event tracing feature was that lttng-ust library is not available on Android (at least via Termux package manager, it is not). If QNX is also missing liblttng-ust, you can try building coreclr by disabling this feature:
# perform a full coreclr build (native+managed components)
./build.sh -subsetcategory coreclr -cmakeargs -DFEATURE_EVENT_TRACE=0
# or only native components
./src/coreclr/build-runtime.sh -cmakeargs -DFEATURE_EVENT_TRACE=0
then you will likely overcome the missing _ZN3ETW5GCLog11FireGcStartEPNS0_14st_GCEventInfoE issue. Also, please note that it is best to keep the versions of installer, libraries and coreclr subset categories in sync, i.e. build from same SHA-1 git hash. This will avoid running into API/ABI mismatches or missing symbols issues. I can imagine it is somewhat challenging to keep up with the running master, for that I suggest to distill to a good/known SHA-1 (e.g. from release/5.0-preview2 branch) and make that build (it's more work but worth it since you are in best position to pull it off). :)
@am11 Thanks! I modified clrfeatures.cmake to have set(FEATURE_EVENT_TRACE 0) if FEATURE_EVENT_TRACE is not defined. Is this the same as you pointed out to disable via cmakeargs?
Yes, it is the same thing (if it is compiling 🙂). For Android, it is disabled very early in the build: https://github.com/dotnet/runtime/blob/25fdaa850f492a9b4144670cac3522bd5b57cd6f/eng/common/cross/toolchain.cmake#L57 (when cmake sets up the toolchain for cross-compilation specified by CMAKE_TOOLCHAIN_FILE in gen-buildsys.sh; this toolchain.cmake script is invoked by cmake before project's first cmake script)
@am11 Is clang a must to build? Is it possible to use gcc? I had issue with using clang to build dotnet executable, which resulted in "exec format error".
Can someone please explain to me a bit more about how does dotnet executable load a .dll application? It would be helpful to my debugging. Thanks in advance!
-clang is default when there is no compiler specified, can use -gcc as you have done before: https://github.com/dotnet/runtime/issues/33374#issuecomment-602789459..
@am11 Thanks! It seems coreclr 2.2.8 doesn't support gcc. Mine build was on 2.2.8. I will bring my changes to net5 and try it from there.
For the record you can read part of the FreeBSD saga here https://github.com/wfurt/corefx/wiki/Building-.NET-Core--2.x-on-FreeBSD and https://github.com/wfurt/corefx/wiki/Building-.NET-Core-3.x-on-FreeBSD here. (This is clone as the original Wiki got lost with runtime transition)
It outlines different strategies in different maturity stages. In general getting the managed part builded turn out to be bigger challenge. The last effort is captured https://github.com/dotnet/runtime/pull/34000 when we can cross-compile native bits and use rest of the build "normally". When I was looking for clang support I did noticed QNX in the list as well.
One more note is that getting changes to master is relatively ok. It is need impossible to get permission for maintenance branches e.g. 2.x and 3.x. So even if you manage to get it working there is no avenue to take that work. Moving to master/5 is the right choice IMHO.
@wfurt Thanks for sharing this!
@wfurt @am11 I am almost there completing compiling net5 coreclr but I am facing following issue. My compiler is gcc 5.4.0.
/home/
/home/
/home/
/home/
/home/
/home/
/home/
@am11 OK. So I solved my issue by using clang3.9 as assembler. But now I am facing another issue when trying to run a helloworld.dll application.
ASSERT [EXCEPT ] at /home/rihe/Github/runtime/src/coreclr/src/pal/src/exception/signal.cpp.971: handle_signal: sigaction() call failed with error code 48 (Not supported)
I would assume this has something to do with QNX's implementation of sigaction() call?
gcc 5.4.0
Maybe try compiling with CXXFLAGS=-Wa,--divide
e.g. CXXFLAGS=-Wa,--divide ./src/coreclr/build-runtime.sh
At some point we achieved the support from gcc 4.9 to 9, however, CI is only testing gcc 7. Need some cycles to fix build on older GCC.
sigaction() call failed with error code 48 (Not supported)
Does it work with this patch:
diff --git a/src/coreclr/src/pal/src/exception/signal.cpp b/src/coreclr/src/pal/src/exception/signal.cpp
index d6d8256610e..5d80be4ffe6 100644
--- a/src/coreclr/src/pal/src/exception/signal.cpp
+++ b/src/coreclr/src/pal/src/exception/signal.cpp
@@ -960,9 +960,9 @@ Parameters :
--*/
void restore_signal(int signal_id, struct sigaction *previousAction)
{
- if (-1 == sigaction(signal_id, previousAction, NULL))
+ if (signal(signal_id, [](int signum) { (void)signum; /* ignored */ }) == SIG_ERR)
{
- ASSERT("restore_signal: sigaction() call failed with error code %d (%s)\n",
+ ASSERT("restore_signal: signal() call failed with error code %d (%s)\n",
errno, strerror(errno));
}
}
(we can make it nicer with cmake introspection etc. later)
@am11 Thanks for your reply! With this change applied, the program stuck at this method. Any idea if I can enable debug logging in libcoreclr.so?
@guesshe, for native runtime code, you can try using lldb by doing something like:
#!/usr/bin/env bash
$(command -v lldb) /path/to/yourapp
# inside lldb REPL, catch all C++ exceptions
(lldb) break set -E C++
(lldb) r
(lldb) bt
if you want to include stacktrace from managed side as well, then you would first need to build SOS plugin (libsosplugin.so) for LLDB from https://github.com/dotnet/diagnostics#building-the-repository, then:
#!/usr/bin/env bash
$(command -v lldb) /path/to/yourapp
# now inside the lldb REPL
break set -E C++
plugin load /path/to/libsosplugin.so
run
# will break on first C++ exception
dumpstack
for more info on diagnostics, there is much more content in the dotnet/diagnostics repo. Note, currently there is no gdb SOS plugin, only lldb is supported (https://github.com/dotnet/diagnostics/issues/272).
@am11 Thanks! It turned out SA_RESTART is not supported in QNX. I removed this flag and it moved a bit further. Now I am having following crash. I guess it has something to do with runtime host but I am not sure, any ideas?
Process 1585176 (dotnet) terminated SIGSEGV code=2 fltno=11 ip=0000000100fb14dd(/tmp/publish/publish/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f099) mapaddr=00000000002354dd. ref=0000000101b1fb40
Memory fault (core dumped)
So I solved my issue by using clang3.9 as assembler.
I think mixing clang and gcc toolchains is problematic. I'd try to fix the broken toolchain first and use either clang or gcc for the entire build.
@am11 oh. Thanks! even for assembler? I will go back and fix the assembler issue. Any idea about the runtime crash? I thought it has something to do with build in host list.
@guesshe, the reason why i mentioned using same toolchain after looking at the segmentation fault is that we have previously hit by SigSegV and it is very hard to troubleshoot and understand the root cause in such case. So it is best if the entire product is build with same toolchain, to rule out such unrelated/external culprits.
Any idea about the runtime crash? I thought it has something to do with build in host list.
I did not get a chance to look deeper, but if you hit it after rebuilding the runtime with gcc, e.g.
# workaround for gcc5
CFLAGS=-Wa,--divide CXXFLAGS=-Wa,--divide ./build.sh -configuration debug
or build entire product with clang or gcc (v7 or above) if possible, then could you attach debugger and collect some data?
@am11 Any idea why the build generated empty files with names like this ???@??@8?@????@@@???????????????????
@am11 So I fixed the assembler issue by setting CMAKE_ASM_FLAGS to -Wa,--divide but I am still having this crash Process 1699864 (dotnet) terminated SIGSEGV code=2 fltno=11 ip=0000000100fb14dd(/tmp/publish/publish/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f099) mapaddr=00000000002354dd. ref=0000000101b28b40
I am trying to find qnx supported lldb.
@am11 Any idea why the build generated empty files with names like this ???@??@8?@????@@@???????????????????
Unicode? is LANG/LC_ALL supported on QNX? What is the file location?
fltno=11
@guesshe, I searched just this string (verbatim) on Google, and surprisingly found majority of QNX related hits on the first result page. This article describes how they solved such SIGSEGV with fltno=11 issue in a simple app on QNX using dladdr(3) . Perhaps you would need to adjust some linker flags to get the paging policy right or maybe some code changes, I am not sure. However, one thing I would try is comment out this line https://github.com/dotnet/runtime/blob/363b7add1906547eeba681b3f3ec3f686a603dee/eng/native/configureplatform.cmake#L343 and rebuild in order to verify whether or not it is due to -fPIC.
@am11 Thanks! I will try to figure it out.
@wfurt This shows up in my host linux. Not on target. I thought these could be build output files but they are all empty.
@guesshe, I remember when NetBSD folks ported coreclr:
the first thing that was done was to pass all platform abstraction layer (PAL) tests, which excercise the CRT functions used by the runtime: https://github.com/dotnet/runtime/blob/59be94b69845ecfbd5a694483c2a4853e99cc64b/docs/workflow/testing/coreclr/unix-test-instructions.md#pal-tests
and then run a simple hello world app using corerun (a basic host that complies with the runtime): https://github.com/dotnet/runtime/blob/7d67d17a9f49ad5f365467fcd3bf0d25f2b9349a/docs/workflow/building/coreclr/linux-instructions.md
iff we get this far, then run the coreclr tests, see src/coreclr/build-test.sh
That's the way I would recommend too (and we did it the same way when we were porting .NET Core to Linux 5 years ago)
@am11 @janvorli Thanks! I will follow this path.
@am11 continuing my debugging journey and getting the test suite running in QNX. I suspect my issue has something to do with how this function is called GetCLRRuntimeHost but I don't know how this related to cruntime implementation.
@am11 @janvorli So I managed to build and run the pal_test suite on my QNX VM. I got this result but I doubt it is valid as I saw some process crash during the text execution. Does that produce a PASS status? I had to modify the bash script to be able to execute in ksh environment but that was not a big change. Next I will focus on fixing up the crashes I saw during the test execution. Most of them happened at strlen and Unable to set thread priority to 0 (error 22)
Finished running PAL tests:
PAL Test Results:
Passed: 726
Failed: 0
@am11 @jkotas Do I need this managed library? System.Private.CoreLib.dll for coreclr to work?
You don't need it for PAL tests, but you need it for the next steps. This is the core managed library containing all the basic functionality and glue between the managed and native parts of the runtime.
As for some PAL tests failing and the results still showing that no tests have failed, this is strange and seems like we may have a bug not recognizing crashes as failures.
@janvorli @am11 @jkotas Here are two types of crashes I saw during testing. One related to strelen function and the other is thread priority.
Process 114688025 (paltest_fprintf_test2) terminated SIGSEGV code=1 fltno=11 ip=0000000100078e10(/usr/lib/ldqnx-64.so.2@strlen+0x0000000000000000) mapaddr=0000000000078e10. ref=0000000000000000
Memory fault (core dumped)
.{1-807d485} ASSERT [THREAD ] at /home/
Process 166789145 (paltest_criticalsectionfunctions_test2) terminated SIGTRAP code=1 fltno=3 ip=00000000080b9f57(/mnt/river/tmp/pal_tests/src/pal/tests/palsuite/threading/CriticalSectionFunctions/test2/paltest_criticalsectionfunctions_test2@DebugBreak+0x000000000005c031) mapaddr=0000000000071f57.
Here is the crash when I tried to launch my helloworld.dll using corerun.
Process 172077080 (corerun) terminated SIGSEGV code=2 fltno=11 ip=00000001010834cb(/mnt/river/tmp/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f087) mapaddr=00000000002354cb. ref=0000000101bfab40
Memory fault (core dumped)
I didn't quite understand how the function GetCLRRuntimeHost works. I did have to comment out one source file named ./src/pal/src/thread/context.cpp due to register access difference between QNX and Linux, this might be the issue? I plan to revisit later as it doesn't seem to be a easy fix.
Making the context.cpp stuff work is essential, primarily for hardware exception handling and for GC thread suspension.
As for the failing PAL tests, you can run the specific tests under a debugger and see why it fails or crashes. Each PAL test is a standalone executable that can be run.
@janvorli Thanks! Here is a question from our lead developer while I am working on get context.cpp file compiled. I have to change register access for QNX target.
His question is "Is it possible to compile without hardware floating point support? Might help there if there is a compile option for software floating point instead of hw floating point -- there would be no need to save and restore FP registers"
Is it possible to compile without hardware floating point support?
Unfortunately not. The JIT uses xmm registers a lot.
@janvorli Thanks!
@am11 @janvorli I fixed the register access issue and enabled context.cpp in my build. However, I am facing a new linker issue. But I do have -fPIC in my compilation flag and in the project I have CMAKE_POSITION_INDEPENDENT_CODE set to TRUE. Any suggestions here?
/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): warning: relocation against CONTEXT_CaptureContext' in readonly section.text'.
/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): relocation R_X86_64_PC32 against symbol `CONTEXT_CaptureContext' can not be used when making a shared object; recompile with -fPIC
/home/rihe/qnx700/host/linux/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: final link failed: Bad value
cc: /home/rihe/qnx700/host/linux/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld error 1
@guesshe, does adding this line https://github.com/am11/runtime/blob/208143dbb181782119e74441a536c9a8efc29808/eng/native/configureplatform.cmake#L290 at the same place and recompiling (after rm -rf artifacts) help? This is currently what I am doing for Solaris bringup (still very much work in progress), and it fixed a similar relocation error for me.
Also if you could show the diff in context2.S, we will understand the error better. Maybe suffixing @gotpcrel will fix the issue.
@am11 I tried this solution and the result is the same. I didn't make any changes to context2.S file under amd64. What do you mean by suffixing @gotpcrel ?
I did have to comment out one source file named ./src/pal/src/thread/context.cpp due to register access difference between QNX and Linux
...
I fixed the register access issue and enabled context.cpp
...
/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): relocation R_X86_64_PC32 against symbol `CONTEXT_CaptureContext'
@guesshe, i mean git diff src/coreclr/src/pal/src/thread/context.cpp how you fixed context issue? Also are you building master or release/3x branch?
@am11 @janvorli It seems supermi is not as critical as pal. Can I disable these sub-projects to test functionalities of pal?
superpmi-shim-collector
superpmi-shim-counter
superpmi-shim-simple
@am11 The way I did was adding QNX specific register access MARCOs. Like following, it is very similar to FreeBSD. I have to include QNX specific header files but not sure if I can share the header file as it is not under any opensource license.
+#elif defined(__QNX__)
OK. Now I disabled superpmi sub-project and it builds. But I got a new crash.
Process 173518872 (corerun) terminated SIGSEGV code=2 fltno=11 ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable+0x00000000000118b2) mapaddr=000000000023552d. ref=0000000101bfcb40
Memory fault (core dumped)
Any suggestions on how to debug this?
https://github.com/dotnet/runtime/issues/33374#issuecomment-609059509 did you try something to fix fltno=11?
@am11 I didn't try anything specific. I bring back the context.cpp file and added qnx as targetOS. Now it complains about registerTMCloneTable, but I can't find this function in coreclr.
@guesshe do all PAL tests pass now? If they don't, there is not much sense in trying to run corerun. Btw, maybe you do that, but until you get everything running, I would recommend running it under gdb (or lldb if you have one on QNX). It is very unlikely to figure out problems just by executing the code and reasoning based on the crash code. You'll need to view the stack trace, local variables, etc. Maybe you do that already, but from your questions above, it seemed you are just trying to run it without debugger.
@janvorli Thanks! The only crash I saw is strlen and thread priority. I think strlen is fine but thread priority might be an issue. I am setting up debugger at the same time. I am trying to get the dump and reload on host gdb tool. Had a version conflict issue yesterday. Will try to resolve it today. I got some help from our lead developer regarding this crash. he said the qcc compiler supports transactional memory in its runtime, but all the symbols are namespaced with the prefix _ITM_ (as in
_ITM_registerTMCloneTable). How does this symbol defined in the binary?
We don't call such a function directly from our code and when I've googled for it, it seems it comes from usage of register_tm_clones function that we also don't use. So I guess it comes from the standard C library or something like that.
@janvorli Thanks! I got some feedback from our kernel developer. Hopefully it will help with understanding the issue.
QNX does not use transactional memory so it has nothing to do with libc.
There is a weak function _ITM_registerTMCloneTable() that gets called by
register_tm_clones() in libgcc (the compiler's supplied runtime
library). Because it's a weak symbol, it's OK to not resolve it and the
library will just skip the call to it.
Is it possible libcoreclr is being built with some option that turns
unresolved weak symbols into an error?
Is it possible libcoreclr is being built with some option that turns
unresolved weak symbols into an error?
No, there is nothing like that. However, looking again at the error
ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable+0x00000000000118b2, I've just realized it has probably nothing to do with that symbol. The offset (0x00000000000118b2) is too far away from that symbol to be in the same function. I think that what happens is that it fails at some place where there are no symbols available and it ends up reporting the closest symbol it finds, which by a mere chance ends up being the registerTMCloneTable.
Thanks! I will first fix the thread priority issue and then put this bin in
gdb and debug. Do you think the strlen is also related? I am not sure if i
can fix the strlen, it might be some limited supoort issue.
Regards
River He
On Thu., Apr. 9, 2020, 12:20 Jan Vorlicek, notifications@github.com wrote:
Is it possible libcoreclr is being built with some option that turns
unresolved weak symbols into an error?No, there is nothing like that. However, looking again at the error
ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable
+0x00000000000118b2, I've just realized it has probably nothing to do
with that symbol. The offset (0x00000000000118b2) is too far away from that
symbol to be in the same function. I think that what happens is that it
fails at some place where there are no symbols available and it ends up
reporting the closest symbol it finds, which by a mere chance ends up being
the registerTMCloneTable.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/runtime/issues/33374#issuecomment-611618143,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AKCJEHQ2OCOXQVEPGZ7LWP3RLXYTHANCNFSM4LEKE3NA
.
I don't see why something as simple as strlen should be problematic, so it seems we end up getting wrong character pointer (maybe a NULL) somewhere and passing it to the strlen later. So the strlen failing is just an indicator of a problem somewhere else.
@janvorli I fixed the thread priority issue. Now, apart from the strlen issues I am having following exceptions. However, these exceptions are not considered failed tests, I still get 726 test cases passed and 0 failure.
...'paltest_namedmutex_test1' failed at line 397. Expression: m != nullptr
'paltest_namedmutex_test1' failed at line 463. Expression: m2 != nullptr
'paltest_namedmutex_test1' failed at line 556. Expression: m != nullptr
'paltest_namedmutex_test1' failed at line 670. Expression: m != nullptr
'paltest_namedmutex_test1' failed at line 287. Expression: parentEvents[i] != nullptr
'paltest_namedmutex_test1' failed at line 695. Expression: InitializeParent(testName, parentEvents, childEvents)
'paltest_namedmutex_test1' failed at line 930. Expression: AbandonTests_Parent()
'paltest_namedmutex_test1' failed at line 273. Expression: WaitForSingleObject(childRunningEvent, FailTimeoutMilliseconds) == WAIT_OBJECT_0
'paltest_namedmutex_test1' failed at line 320. Expression: AcquireChildRunningEvent(testName, childRunningEvent)
'paltest_namedmutex_test1' failed at line 759. Expression: InitializeChild(testName, childRunningEvent, parentEvents, childEvents)
@am11 @janvorli Is feature no stress_log supported? If I set -DFEATURE_NO_STRESSLOG, will this disable the feature?
@guesshe you can set that, but I am not sure why would you want to do that.
@janvorli @wfurt @jkotas With the help of our kernel developers, we managed to fix this crash and another stack issue. Now it proceeded to a point that looks very promising.
coreclr_initialize failed - status: 0x80004005
By reading porting notes from @wfurt, I downloaded netcore 5 sdk 5.0 using snap and published to netcoreapp5.0 targetframework. However, I still got the same issue.
The commit I checkout from master is 62112b0abb36654775552842231dc48a0d032655.
Any suggestions? Is this because I am on master not on the preview branch?
That maps to E_FAIL and there are many places where this can fail. You can try to set COREHOST_TRACE=1 and check if that provides any hints. (I assume you disabled r2r, right?)
I don't think the branch matters.
@wfurt Thanks! What is r2rm? Does this failure mean the cruntime is passed?
There was typo. R2R -> Ready To Run. With crossgen, we may put in native bits so make startup faster. Because of that, you many not be able to simply copy assemblies targeted for other platform. It should work for the hello but I'm wondering how did you get BCL assemblies.
Back then, I used COMPlus_ZapDisable=1 and COMPlus_ReadyToRun=0 when trying to use Linux assemblies on FreeBSD. @janvorli or @jkotas may know better if that is still applicable.
@wfurt Is that an environment variable? I don't recall I set that. For BCL assemblies, I plan to upload the built tools and source code to target and build from there directly instead of cross-compiling.
yes, environment. I'm not quite sure what you mean by the previous post. In order to build assemblies you need to have working dotnet cli and c# compiler is written (mostly) in c#.
forerun cannot function without System.Private.CoreLib.dll (and perhaps others), so the question is how did you get one?
I am not sure if I understand it correctly. I did built
src/installer/corehost/ project which contains dotnet executable binary and
I can run it to load hello_world.dll (which failed at the same point as
using corerun). Do you mind if we have a quick chat offline on this topic?
Via zoom or something like that?
Regards
River He
On Fri., Apr. 24, 2020, 17:26 Tomas Weinfurt, notifications@github.com
wrote:
yes, environment. I'm not quite sure what you mean by the previous post.
In order to build assemblies you need to have working dotnet cli and c#
compiler is written (mostly) in c#.
forerun cannot function without System.Private.CoreLib.dll (and perhaps
others), so the question is how did you get one?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/runtime/issues/33374#issuecomment-619245861,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AKCJEHRW4GDX67KNMBV37SDROH7ZPANCNFSM4LEKE3NA
.
sure, ping me with details: tweinfurt at yahoo.
I don't think your test is valid. You can try the steps on Linux (or other supported platform)
@wfurt @janvorli We are trying to debug this 0x80004005 error and following is the trace output. It looks like it failed to load System.Private.CoreLib.dll. The trace is trimmed and formatted to a way that is easier to read. Is System.Private.CoreLib.dll a mandatory to have in order to run a empty main function? My hello_world app only have one line " static void Main(string[] args) {}".
Starting
corhost.cpp - CorRuntimeHostBase::Start
ceemain.cpp - EnsureEEStarted - g_fEEShutDown==0
ceemain.cpp - EEStartup - InitializeClrNotifications - status==0000000000
ceemain.cpp - EEStartup - InitializeJITNotificationTable - status==0000000000
ceemain.cpp - EEStartup - Initialize - status==0000000000
ceemain.cpp - EEStartupHelper - start
ceemain.cpp - EEConfig::Setup - start
ceemain.cpp - EEConfig::Setup - done
ceemain.cpp - InitializeStratupFlags - done
ceemain.cpp - PAL_SetShutdownCallback - done
ceemain.cpp - InitializeLogging - done
ceemain.cpp - EnsureRtlFunctions - done
ceemain.cpp - g_pConfig->sync - done
ceemain.cpp - InitializeSpinConstants - done
ceemain.cpp - InitializeStubManagers - done
ceemain.cpp - Stubs - done
ceemain.cpp - Inits - done
rcthread.cpp - DebuggerRCTthread started
m_thread!=NULL, hr==0000000000
rcthread.cpp - Thread created: hr==0000000000
rcthread.cpp - Done: hr==0000000000
ceemain.cpp - InitializeDebugger - done
ceemain.cpp - Profiling service - hr==0000000000 - done
ceemain.cpp - InitPreStubManager - done
ceemain.cpp - g_pGCHeap->Initialize - hr==0000000000 - done
ceemain.cpp - SystemDomain debugging - done
ceemain.cpp - MethodDesc::Init - start
ceemain.cpp - MethodDesc::Init - done
ceemain.cpp - SD Init - start
appdomain.cpp - Init - start
appdomain.cpp - LOG - done
appdomain.cpp - ZapDisable - done
appdomain.cpp - GetInternalSystemDirectory - hr==0x8007007a - done
appdomain.cpp - GetInternalSystemDirectory(buffer) - hr==0x8007007a - done
appdomain.cpp - LoadBaseSystemClasses - start
appdomain.cpp - LoadBaseSystemClasses - start
appdomain.cpp - ETWOnStartup - done
appdomain.cpp - OpenSystem - start
pefile.cpp - OpenSystem - start
pefile.cpp - DoOpenSystem - start
pefile.cpp - ETWOnStartup - start
pefile.cpp - ETWOnStartup - done
pefile.cpp - BindToSystem - start
appdomain.hpp - SystemDirectory is /
coreclrbindercommon.cpp - AssemblyBinder::BindToSystem - start
assemblybinder.cpp - GetAssembly - sCoreLib==/home/qnxuser/ - start
assemblybinder.cpp - AssemblyBinder::GetAssembly - start
Assembly path is /
coreassemblyspec.cpp - BinderAcquirePEimage - start
coreassemblyspec.cpp - OpenImage - start
coreassemblyspec.cpp - TryOpenFile - start
peimage.cpp - TryOpenFile - m_path==/home/qnxuser/System.Private.CoreLib.dll
coreassemblyspec.cpp - TryOpenFile - done - hr==0x80070002
AssemblyBinder::BindToSystem - done - hr==0x80070002
ceemain.cpp - CATCH - done
ceemain.cpp - if !FAILED - hr==0000000000 - done
ceemain.cpp - EEStartup - EEStartupHelper - status==0x80004005
ceemain.cpp:327 - g_EEStartupStatus==0x80004005
corhost.cpp - Done - hr==0x80004005
Start: 0x80004005
coreclr_initialize failed - status: 0x80004005
The error 0x80070002 means "File not found". Is it possible that there is some access problem to the /home/qnxuser/System.Private.CoreLib.dll?
Btw, error codes starting with 0x8007 represent windows error codes. The lowest 16 bits of the code contain a windows error code. These windows error codes are described here: https://docs.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
@janvorli I don't have this managed library built. I only have libcoreclr.so. Based on previous posts in this thread, I had a feeling I don't need managed libraries to test basic PAL functionalities. Following is quoted from previous posts.
"the first thing that was done was to pass all platform abstraction layer (PAL) tests, which excercise the CRT functions used by the runtime: https://github.com/dotnet/runtime/blob/59be94b69845ecfbd5a694483c2a4853e99cc64b/docs/workflow/testing/coreclr/unix-test-instructions.md#pal-tests
and then run a simple hello world app using corerun (a basic host that complies with the runtime): https://github.com/dotnet/runtime/blob/7d67d17a9f49ad5f365467fcd3bf0d25f2b9349a/docs/workflow/building/coreclr/linux-instructions.md
iff we get this far, then run the coreclr tests, see src/coreclr/build-test.sh"
I tried a Linux version of corerun and libcoreclr.so, it doesn't give me an error looking for System.Private.CoreLib.dll. Did I misunderstand something in the instructions above?
The part that tests the PAL is the pal test suite that you've ran before. That's the only part of the testing that doesn't run managed code.
The corerun is a tool to run managed applications. So it requires System.Private.CoreLib.dll and other managed assemblies (depending on what your hello world managed app needs).
I assume the Linux version didn't fail because the System.Private.CoreLib.dll is present.
@janvorli I don't recall I put the System.Private.CoreLib.dll in the same directory as libcoreclr.so, maybe it also searches for other locations? May I use a Linux-version of System.Private.CoreLib.dll to see if it works? If not, how can I build a QNX-version of System.Private.CoreLib.dll?
Yes, you can use the Linux version, it should just work (provided it is built from exactly the same state of the source tree as the libcoreclr.so that you've built for QNX and it is the same build flavor - you cannot combine Release build of libcoreclr.so with Debug or Checked build of System.Private.CoreLib.dll and vice versa).
@janvorli Thanks! I will give it a try. The same state you mean it should come out of the same commit? Or similar? What errors it could give if they are from different commit? I would prefer to actually build it for QNX but it doesn't seem to support cross-compiling. I might have to upload the source code to QNX directly and run the build from there.
I mean the same commit. There are shared data structures between libcoreclr.so and System.Private.CoreLib.dll, so any change in the layout of those structures would break things. Trying to use commits close to each other might work, but it is not worth the possible problems investigation.
also debug/release needs to match, right? (at least is did in the past that release System.Private.CoreLib.dll did not work with debug coreclr)
@janvorli @wfurt Thanks! I will try it out and let you know the result.
also debug/release needs to match, right?
Yes, I've mentioned that in a comment above.
@janvorli It seems we still have issue with Linux-version of System.Private.CoreLib.dll, any idea what does this error mean? The new error is that the PE Image file is not in native machine format.
Can you please set the following env variables and try again? This should let the runtime load only the IL code from the System.Private.CoreLib.dll and not the already precompiled binary code that is likely causing the trouble.
COMPlus_ZapDisable=1
COMPlus_ReadyToRun=0
@quesshe, it was discovered that the COMPlus_ZapDisable handling was accidentally disabled for some time and fixed four days ago in #35741. I'm not sure what state of the repository you are using, but you'll likely need that fix to be able to load the System.Private.CoreLib.dll built on Linux. You can easily port that change to any state of the repository as it just removes an #ifdef around getting the option related to that env variable.
Thanks! I will do that and try it out.
On Wed, May 6, 2020 at 4:31 AM Jan Vorlicek notifications@github.com
wrote:
@quesshe, it was discovered that the COMPlus_ZapDisable handling was
accidentally disabled for some time and fixed four days ago in #35741
https://github.com/dotnet/runtime/pull/35741. I'm not sure what state
of the repository you are using, but you'll likely need that fix to be able
to load the System.Private.CoreLib.dll built on Linux. You can easily port
that change to any state of the repository as it just removes an #ifdef
around getting the option related to that env variable.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/runtime/issues/33374#issuecomment-624513550,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AKCJEHQST5UX3GDAC6JIR5LRQEN6DANCNFSM4LEKE3NA
.
--
RIVER HE
Cell: +1 613 608 1686
/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): relocation R_X86_64_PC32 against symbol `CONTEXT_CaptureContext' can not be used when making a shared object; recompile with -fPIC
I was also getting this error when compiling coreclr's superpmi project with illumos sysroot on Ubuntu 18.04. I was using gcc v8.4.0 and binutils v2.25.1, both built for illumos target. The fix was to upgrade binutils to v2.33.1, without code modifications in coreclr. It was due to an upstream bug in binutils's assembler (as) or archiver (ar), which was fixed around v2.29-v2.30.
@guesshe Can you please tell me if you get the corehost to work?
Most helpful comment
You don't need it for PAL tests, but you need it for the next steps. This is the core managed library containing all the basic functionality and glue between the managed and native parts of the runtime.
As for some PAL tests failing and the results still showing that no tests have failed, this is strange and seems like we may have a bug not recognizing crashes as failures.