I've built Ray on Odroid-XU4 board (http://www.hardkernel.com/main/products/prdt_info.php?g_code=G143452239825). As I try to run a simple application on it, the following issues is reported by Ray:
Attached Ray_Issue_XU4.log represents the Ray log.
Ray_Issue_XU4.log
I'm a little surprised to see this error
/ray/src/thirdparty/arrow/cpp/src/plasma/io.cc98 Check failed: version == PLASMA_PROTOCOL_VERSION version = 4
since that value has never changed from 0, see https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.h#L34
It's possible that we're doing the arithmetic incorrectly somewhere in this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L94-L109 and this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L63-L69.
E.g., maybe one of the types has the wrong size or something or there is a mismatch between the two blocks.
Another thing to verify is that you can start the plasma store by hand without any trouble. In your case probably
/usr/local/lib/python2.7/dist-packages/ray-0.2.0-py2.7-linux-armv7l.egg/ray/plasma/../core/src/plasma/plasma_store -s /tmp/s1 -m 1000000
If that works, then try connecting a plasma manager. E.g., check out the instructions in this comment https://github.com/ray-project/ray/issues/108#issuecomment-308572084.
I have the same issue on a different platform (Ubuntu 16.04 VM running on Windows 7). I followed the instructions for connecting a plasma manager, and was able to start a plasma store, but when I tried to start a plasma manager, I received a /ray/src/thirdparty/arrow/cpp/src/plasma/io.cc98 Check failed: version == PLASMA_PROTOCOL_VERSION version = 4 error thrown from the plasma store, and /ray/src/plasma/plasma_manager.cc483 Check failed: _s.ok() Bad status: IOError: Broken pipe thrown from the plasma manager. Any advice on how to proceed?
@arvindc95 @akzare could you try cherry-picking this commit apache/arrow#1172, recompiling Arrow, and see if it fixes the problem? I just looked through the code in that file and spotted that potential bug.
Let me know if you have questions about how to do this.
If that doesn't work, then I think we'll just need to add a lot of print statements (e.g., in this function https://github.com/apache/arrow/blob/dc129d60fbffbf3a5b71b1f7987f7dab948b3d61/cpp/src/plasma/io.cc#L90) and print the actual bytes that are being sent and see if we can infer anything from that.
@robertnishihara thanks for the help, that commit helped me get ray initialized; I'm able to put and get objects from the plasma store, and use the remote function when there's nothing to be parallelized, but when I try running the time.sleep example in the documentation (http://ray.readthedocs.io/en/latest/tutorial.html#remote-functions), I get a segmentation fault thrown from the local scheduler. Do you have any ideas how I can debug this? Are there log files generated by the scheduler?
Glad to hear it, and thanks for trying it out! Sounds like there's a bug in the local scheduler (perhaps similar to the previous bug).
You're rebuilding all of Ray, right? Because the local scheduler also communicates with the plasma store, so it probably needs the same fix from apache/arrow#1172.
Some processes log to /tmp/raylogs, so it's worth looking at the most recent files in there and see if anything turns up, but if you're starting Ray with ray.init(), then the local scheduler STDERR/STDOUT will just go to the terminal.
What I would suggest is trying to run the same workload that is causing the crash, but to start the local scheduler in gdb. To do that, you could do something like the following.
First modify
to be something like
import IPython
IPython.embed()
# pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
pid = 9999
Then start Python and do import ray and ray.init(). This will open up IPython when it tries to start the local scheduler. Run print(command) in the IPython shell to print the command that Ray wants to use to start the local scheduler.
Then go to a different terminal window, and do
gdb ray/python/ray/core/src/local_scheduler/local_scheduler
Then do run followed by the command printed by print(command). to start the local scheduler in gdb. However, you'll need to drop the initial executable from the command, AND you'll need to add quotes around the full argument to the -w flag, which is pretty long. Otherwise you'll get an error saying unknown flag or something like that.
Then go back to the IPython shell and do exit()
Then run your workload and see what errors are caught in gdb.
Note that if the error is uninformative, we may need to recompile Ray with more debug information. E.g., maybe add a -g to the line
@robertnishihara I tried the steps you outlined for using gdb, but when I tried to run my workload I kept getting an exception when defining a function with the @ray.remote decorator; I've attached the error thrown:
decorator_error.txt
Also, when making the fix you referenced, I made the code change in the arrow code and then reran python setup.py install in order to rebuild Ray. Let me know if this procedure is incomplete for rebuilding Ray (I also ran this after changing local_scheduler_services.py because the IPython shell wasn't showing up)
Thanks again for your help!
@arvindc95 interesting, that seems like the same error as #394.
You could try using IPython instead of Python, since #394 was only an issue in the regular Python interpreter.
It's also possible that when you reran python setup.py install, it undid your changes to Arrow. Can you check that your changes were unaffected? Or perhaps comment out this line
Also, instead of using python setup.py install, I'd suggest using python setup.py develop because that way whenever you change the Python code, you won't need to rerun setup.py, the changes will automatically be used.
@robertnishihara Using IPython helped; my workload runs successfully, but the debugger throws the following error immediately after the workload completes: Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
50 ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory. Would this segfault be from local_scheduler_services.py or any of the functions it calls?
If you do bt in gdb, does that print anything?
This error looks similar https://groups.google.com/forum/#!topic/jansson-users/u78eGC15itw.
cc @atumanov
Yes, here's the output:
Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
50 ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory.
(gdb) bt
format=0x809fd45 "ZADD %b %s %b", ap=0xbfffe928 "") at hiredis.c:262
format=0x809fd45 "ZADD %b %s %b", ap=0xbfffe920 "\340\316\f\b\036")
at async.c:654
format=0x809fd45 "ZADD %b %s %b") at async.c:669
key=0x80ccee0 "event_log:\213\313\363\265O\312\300苑#\206蓪\234\274\301\332T\222\004", key_length=30,
value=0x80cc8e8 "[[1507822835.414047, \"ray:get_task\", 1, {}], [1507822914.265168, \"ray:import_function_to_run\", 1, {}], [1507822914.265763, \"ray:import_function_to_run\", 2, {}], [1507822914.266116, \"ray:import_functio"...,
value_length=1520, timestamp=1507822920.4063809)
at /home/achand/ray/src/common/logging.cc:100
at /home/achand/ray/src/common/thirdparty/ae/ae.c:412
at /home/achand/ray/src/common/thirdparty/ae/ae.c:455
at /home/achand/ray/src/common/event_loop.cc:58
It looks like it is using SSE2 instructions which probably aren't available on ARM. Could it be that there is some issue with the (cross-)compilation?
@pcmoritz I checked the instruction sets supported in the VM guest and SSE2 is one of them (it's supported in the VM host as well)
@arvindc95 I created a PR here: https://github.com/ray-project/ray/pull/1122 Could you try both the commits in the PR and see if one of them makes it work? These are both fixing potential problems here. Thanks!
In particular, we'd be interested in knowing which of the two commits fixes it (assuming one of them does in fact fix it).
seg_fault_fix_logging.txt
seg_fault_add_casts.txt
Both failed the same way as before; the segfault happened after the results of foo.remote() were returned. I made the the logging code change, ran python setup.py develop, then tried running a workload, and repeated the process for the static cast addition as well. The gdb logs show the updated logging code change, and the lines referenced are slightly different between the two logs, so I think the changes were compiled; let me know if I missed anything.
Also, I've been manually starting ray because when I don't, the plasma store never initializes. I changed the socket name from /tmp/s1 to /tmp/s2 in case the same socket was being reused every time I manually started the store, but the store was still being initialized, so I'm not sure why it doesn't get made when I don't manually start the store.
Hm, thanks for trying it out. Is there any chance you can share your VirtualBox image together with instructions to reproduce the problem with us or an EC2 AMI if you have one so we can dig deeper into this?
I was able to reproduce on 32bit Ubuntu 16.04 and fix. I put together a quick PR that fixes it for me. Could you please try out https://github.com/ray-project/ray/pull/1126. Thanks.
@atumanov The changes from your PR worked, thanks so much!
Also, thanks to @pcmoritz and @robertnishihara thank you for your help resolving this as well! Would you still like me post my VirtualBox image?
@arvindc95 , awesome, glad to hear! The virtualbox image will be helpful for testing, in case we need to reproduce any other problems you encounter. If you are in a position to provide us with the ODROID platform for testing purposes as well, even better :)
Closing for now since a lot of things have changed.