Ray: Ray issue on Odroid-XU4 board

Created on 24 Sep 2017 · 22Comments · Source: ray-project/ray

I've built Ray on Odroid-XU4 board (http://www.hardkernel.com/main/products/prdt_info.php?g_code=G143452239825). As I try to run a simple application on it, the following issues is reported by Ray:

Attached Ray_Issue_XU4.log represents the Ray log.
Ray_Issue_XU4.log

Source

akzare

All 22 comments

I'm a little surprised to see this error

/ray/src/thirdparty/arrow/cpp/src/plasma/io.cc98 Check failed: version == PLASMA_PROTOCOL_VERSION version = 4

since that value has never changed from 0, see https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.h#L34

It's possible that we're doing the arithmetic incorrectly somewhere in this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L94-L109 and this block https://github.com/apache/arrow/blob/b41a4ee2322d0084ff78b78ccfebc4536f7e0a62/cpp/src/plasma/io.cc#L63-L69.

E.g., maybe one of the types has the wrong size or something or there is a mismatch between the two blocks.

robertnishihara on 24 Sep 2017

Another thing to verify is that you can start the plasma store by hand without any trouble. In your case probably

/usr/local/lib/python2.7/dist-packages/ray-0.2.0-py2.7-linux-armv7l.egg/ray/plasma/../core/src/plasma/plasma_store -s /tmp/s1 -m 1000000

If that works, then try connecting a plasma manager. E.g., check out the instructions in this comment https://github.com/ray-project/ray/issues/108#issuecomment-308572084.

robertnishihara on 24 Sep 2017

I have the same issue on a different platform (Ubuntu 16.04 VM running on Windows 7). I followed the instructions for connecting a plasma manager, and was able to start a plasma store, but when I tried to start a plasma manager, I received a /ray/src/thirdparty/arrow/cpp/src/plasma/io.cc98 Check failed: version == PLASMA_PROTOCOL_VERSION version = 4 error thrown from the plasma store, and /ray/src/plasma/plasma_manager.cc483 Check failed: _s.ok() Bad status: IOError: Broken pipe thrown from the plasma manager. Any advice on how to proceed?

arvindc95 on 4 Oct 2017

@arvindc95 @akzare could you try cherry-picking this commit apache/arrow#1172, recompiling Arrow, and see if it fixes the problem? I just looked through the code in that file and spotted that potential bug.

Let me know if you have questions about how to do this.

If that doesn't work, then I think we'll just need to add a lot of print statements (e.g., in this function https://github.com/apache/arrow/blob/dc129d60fbffbf3a5b71b1f7987f7dab948b3d61/cpp/src/plasma/io.cc#L90) and print the actual bytes that are being sent and see if we can infer anything from that.

robertnishihara on 5 Oct 2017

@robertnishihara thanks for the help, that commit helped me get ray initialized; I'm able to put and get objects from the plasma store, and use the remote function when there's nothing to be parallelized, but when I try running the time.sleep example in the documentation (http://ray.readthedocs.io/en/latest/tutorial.html#remote-functions), I get a segmentation fault thrown from the local scheduler. Do you have any ideas how I can debug this? Are there log files generated by the scheduler?

arvindc95 on 5 Oct 2017

Glad to hear it, and thanks for trying it out! Sounds like there's a bug in the local scheduler (perhaps similar to the previous bug).

You're rebuilding all of Ray, right? Because the local scheduler also communicates with the plasma store, so it probably needs the same fix from apache/arrow#1172.

Some processes log to /tmp/raylogs, so it's worth looking at the most recent files in there and see if anything turns up, but if you're starting Ray with ray.init(), then the local scheduler STDERR/STDOUT will just go to the terminal.

What I would suggest is trying to run the same workload that is causing the crash, but to start the local scheduler in gdb. To do that, you could do something like the following.

First modify

https://github.com/ray-project/ray/blob/aebe9f937451bfa10aa0f2a41bafcf4747fb60f0/python/ray/local_scheduler/local_scheduler_services.py#L122

to be something like
```
import IPython
IPython.embed()
# pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
pid = 9999
```
Then start Python and do import ray and ray.init(). This will open up IPython when it tries to start the local scheduler. Run print(command) in the IPython shell to print the command that Ray wants to use to start the local scheduler.
Then go to a different terminal window, and do
```
gdb ray/python/ray/core/src/local_scheduler/local_scheduler
```
Then do run followed by the command printed by print(command). to start the local scheduler in gdb. However, you'll need to drop the initial executable from the command, AND you'll need to add quotes around the full argument to the -w flag, which is pretty long. Otherwise you'll get an error saying unknown flag or something like that.
Then go back to the IPython shell and do exit()
Then run your workload and see what errors are caught in gdb.

Note that if the error is uninformative, we may need to recompile Ray with more debug information. E.g., maybe add a -g to the line

https://github.com/ray-project/ray/blob/aebe9f937451bfa10aa0f2a41bafcf4747fb60f0/src/common/CMakeLists.txt#L9

robertnishihara on 5 Oct 2017

👍1

@robertnishihara I tried the steps you outlined for using gdb, but when I tried to run my workload I kept getting an exception when defining a function with the @ray.remote decorator; I've attached the error thrown:
decorator_error.txt

Also, when making the fix you referenced, I made the code change in the arrow code and then reran python setup.py install in order to rebuild Ray. Let me know if this procedure is incomplete for rebuilding Ray (I also ran this after changing local_scheduler_services.py because the IPython shell wasn't showing up)

Thanks again for your help!

arvindc95 on 10 Oct 2017

@arvindc95 interesting, that seems like the same error as #394.

You could try using IPython instead of Python, since #394 was only an issue in the regular Python interpreter.

It's also possible that when you reran python setup.py install, it undid your changes to Arrow. Can you check that your changes were unaffected? Or perhaps comment out this line

https://github.com/ray-project/ray/blob/b1660c4edfcbfec5645103eaa0aad58a255015fa/src/thirdparty/download_thirdparty.sh#L16

Also, instead of using python setup.py install, I'd suggest using python setup.py develop because that way whenever you change the Python code, you won't need to rerun setup.py, the changes will automatically be used.

robertnishihara on 11 Oct 2017

👍1

@robertnishihara Using IPython helped; my workload runs successfully, but the debugger throws the following error immediately after the workload completes: Program received signal SIGSEGV, Segmentation fault. __strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50 50 ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory. Would this segfault be from local_scheduler_services.py or any of the functions it calls?

arvindc95 on 11 Oct 2017

If you do bt in gdb, does that print anything?

This error looks similar https://groups.google.com/forum/#!topic/jansson-users/u78eGC15itw.

cc @atumanov

robertnishihara on 12 Oct 2017

Yes, here's the output:

Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
50 ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory.
(gdb) bt

0 __strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50

1 0x080965ce in redisvFormatCommand (target=0xbfffe8b8,

format=0x809fd45 "ZADD %b %s %b", ap=0xbfffe928 "") at hiredis.c:262

2 0x0809b91c in redisvAsyncCommand (ac=0x80c83c0, fn=0x0, privdata=0x0,

format=0x809fd45 "ZADD %b %s %b", ap=0xbfffe920 "\340\316\f\b\036")
at async.c:654

3 0x0809b99c in redisAsyncCommand (ac=0x80c83c0, fn=0x0, privdata=0x0,

format=0x809fd45 "ZADD %b %s %b") at async.c:669

4 0x0806c621 in RayLogger_log_event (db=0x80c7de0,

key=0x80ccee0 "event_log:\213\313\363\265O\312\300Է#\206Ɍ\234\274\301\332T\222\004", key_length=30, 
value=0x80cc8e8 "[[1507822835.414047, \"ray:get_task\", 1, {}], [1507822914.265168, \"ray:import_function_to_run\", 1, {}], [1507822914.265763, \"ray:import_function_to_run\", 2, {}], [1507822914.266116, \"ray:import_functio"..., 
value_length=1520, timestamp=1507822920.4063809)
at /home/achand/ray/src/common/logging.cc:100

5 0x08056c35 in process_message(aeEventLoop, int, void, int) ()

6 0x0807bcbd in aeProcessEvents (eventLoop=0x80bea38, flags=3)

at /home/achand/ray/src/common/thirdparty/ae/ae.c:412

7 0x0807c19b in aeMain (eventLoop=0x80bea38)

at /home/achand/ray/src/common/thirdparty/ae/ae.c:455

8 0x0805f8f8 in event_loop_run (loop=0x80bea38)

at /home/achand/ray/src/common/event_loop.cc:58

arvindc95 on 12 Oct 2017

It looks like it is using SSE2 instructions which probably aren't available on ARM. Could it be that there is some issue with the (cross-)compilation?

pcmoritz on 13 Oct 2017

@pcmoritz I checked the instruction sets supported in the VM guest and SSE2 is one of them (it's supported in the VM host as well)
capture

arvindc95 on 13 Oct 2017

@arvindc95 I created a PR here: https://github.com/ray-project/ray/pull/1122 Could you try both the commits in the PR and see if one of them makes it work? These are both fixing potential problems here. Thanks!

pcmoritz on 13 Oct 2017

👍1

In particular, we'd be interested in knowing which of the two commits fixes it (assuming one of them does in fact fix it).

robertnishihara on 13 Oct 2017

seg_fault_fix_logging.txt
seg_fault_add_casts.txt

Both failed the same way as before; the segfault happened after the results of foo.remote() were returned. I made the the logging code change, ran python setup.py develop, then tried running a workload, and repeated the process for the static cast addition as well. The gdb logs show the updated logging code change, and the lines referenced are slightly different between the two logs, so I think the changes were compiled; let me know if I missed anything.

arvindc95 on 13 Oct 2017

Also, I've been manually starting ray because when I don't, the plasma store never initializes. I changed the socket name from /tmp/s1 to /tmp/s2 in case the same socket was being reused every time I manually started the store, but the store was still being initialized, so I'm not sure why it doesn't get made when I don't manually start the store.

arvindc95 on 14 Oct 2017

Hm, thanks for trying it out. Is there any chance you can share your VirtualBox image together with instructions to reproduce the problem with us or an EC2 AMI if you have one so we can dig deeper into this?

pcmoritz on 14 Oct 2017

I was able to reproduce on 32bit Ubuntu 16.04 and fix. I put together a quick PR that fixes it for me. Could you please try out https://github.com/ray-project/ray/pull/1126. Thanks.

atumanov on 14 Oct 2017

👍1

@atumanov The changes from your PR worked, thanks so much!
Also, thanks to @pcmoritz and @robertnishihara thank you for your help resolving this as well! Would you still like me post my VirtualBox image?

arvindc95 on 16 Oct 2017

@arvindc95 , awesome, glad to hear! The virtualbox image will be helpful for testing, in case we need to reproduce any other problems you encounter. If you are in a position to provide us with the ODROID platform for testing purposes as well, even better :)

atumanov on 16 Oct 2017

Closing for now since a lot of things have changed.

robertnishihara on 2 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings