Ray: Using plasma on Kubernetes

Created on 12 Dec 2017 · 37Comments · Source: ray-project/ray

I am trying to use plasma to store datasets in memory, and share them between pods.

I find that this does not work well, and in particular, plasma.get/plasma.put tends to hang with no specific error message.

I am sure other people have tried this setup, I would love to hear about their experience.

The setup is:

Running plasma_store and clients in different pod on the same node
The socket is placed in one R/W volume mounted in all pods
Tried plasma_store using shm and hugepages (-h). shm hangs if get()ing an object submitted by another client, hugepages seems to complain about missing huge pages when mmapping

Note that I could get this running using Docker containers just fine. I understand that some of those issues are due to Kubernetes more than plasma, but I would love some pointers.

cc @mitar

Source

remram44

👍7

Most helpful comment

This probably should be raising a ValueError 😅 but I agree that it's a separate problem. I'll try again with valid IDs.

remram44 on 13 Dec 2017

👍4

All 37 comments

Hey @remram44 thanks for bringing this up! Do you have Kubernetes scripts to reproduce this and instructions how to set it up on EC2 and reproduce this issue? Any pointers are welcome.

pcmoritz on 12 Dec 2017

@remram44 how did you get it working between Docker containers? Did you have to do anything special?

robertnishihara on 12 Dec 2017

On Docker, I didn't have to do anything. I ran with native Docker on macOS. However on doing this again it seems to work if I don't pass an explicit ObjectID to put(), otherwise get() hangs.

Server:

docker run -ti --rm --name plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow plasma_store -s /mnt/socket/plasma -m 10000000

Sender:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.put("hello, world").binary())'
b'\x10\x85\x1b\xc6\xe3\xc6\x9f\x8d\x13\x1e\xa7\xda\xf3\xd9\xf0\x0cZ\xf1\xd7/'

Getter:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.get(plasma.ObjectID(b"\x10\x85\x1b\xc6\xe3\xc6\x9f\x8d\x13\x1e\xa7\xda\xf3\xd9\xf0\x0cZ\xf1\xd7/")))'
hello, world

Explicit sender:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); client.put("hello, world", plasma.ObjectID(b"testidhere"))'

Getter:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.get(plasma.ObjectID(b"testidhere")))'
<hangs>

remram44 on 13 Dec 2017

😕1

I ran this on Kubernetes on Google Cloud with this configuration:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasmaserver
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasmaserver
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'plasma_store -s /mnt/socket/plasma -m 10000000']
# or
#        command: ['/bin/sh', '-c', 'plasma_store -s /mnt/socket/plasma -m 10000000 -d /mnt/hugepages -h']
        volumeMounts:
        - mountPath: /mnt/socket
          name: socket
        - mountPath: /mnt/hugepages
          name: hugepages
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
      - name: hugepages
        persistentVolumeClaim:
          claimName: hugepagesvc
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasma1
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasma1
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'while true; do sleep 30; done']
        volumeMounts:
        - mountPath: /mnt
          name: socket
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasma2
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasma2
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'while true; do sleep 30; done']
        volumeMounts:
        - mountPath: /mnt
          name: socket
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: plasmasocketv
  labels:
    thing: plasmasocket
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/var/plasma-rr4"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: plasmasocketvc
spec:
  storageClassName: ""
  selector:
    matchLabels:
      thing: plasmasocket
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: hugepagesv
  labels:
    thing: hugepages
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/var/hugepages"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: hugepagesvc
spec:
  storageClassName: ""
  selector:
    matchLabels:
      thing: hugepages
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

Then I ran commands on plasma1 and plasma2 using kubectl exec.

remram44 on 13 Dec 2017

@remram44 Thanks! The hanging you are seeing is unrelated to using docker. It hangs because ObjectIDs need to be exactly 20 bytes long. So even without docker, this hangs:

In [5]: client.put("hello", plasma.ObjectID(b"hi"))
Out[5]: ObjectID(68690000537f0000300000000000000091010000)

In [6]: client.get(plasma.ObjectID(b"hi"))

Whereas this works:

In [3]: client.put("hello", plasma.ObjectID(20 * b"h"))
Out[3]: ObjectID(6868686868686868686868686868686868686868)

In [4]: client.get(plasma.ObjectID(20*b"h"))
Out[4]: 'hello'

Can you try if this also fixes the problem in Kubernetes?

pcmoritz on 13 Dec 2017

This probably should be raising a ValueError 😅 but I agree that it's a separate problem. I'll try again with valid IDs.

remram44 on 13 Dec 2017

👍4

I am surprised that your Docker example even works. Because Plasma store uses /dev/shm as a default to share objects on Linux, but that is not shared between containers in Docker. So your server and client do not have the same /dev/shm. So I am not sure how communication would work here?

mitar on 13 Dec 2017

👍2

Hanging on invalid ObjectID is really surprising. :-)

(It is interesting that GitHub colors the invalid ObjectID with red background?)

mitar on 13 Dec 2017

I don't know why it is red :)

I agree it is not good behaviour and should give an error. I submitted a JIRA ticket here and will fix it ASAP: https://issues.apache.org/jira/browse/ARROW-1919

Thanks for finding the problem!

pcmoritz on 13 Dec 2017

@pcmoritz: Do you understand why sharing works between containers even if /dev/shm is not shared?

mitar on 13 Dec 2017

I do not understand it and have not tried it, but it seems to be possible to share memory between docker containers in general, see https://stackoverflow.com/questions/29173193/shared-memory-with-docker-containers-docker-version-1-4-1

pcmoritz on 13 Dec 2017

It seems we would have to use --ipc argument, but example above is not. This is why I am confused. @remram44, which Docker version are you using? If you go to two docker containers and you create a file in its /dev/shm, does it appear in the other container?

Also, @pcmoritz, is /dev/shm being used by Plasma store or is memory sharing done in some other way?

mitar on 13 Dec 2017

By default on linux it is using /dev/shm and on mac it is using /tmp/ and can be configured to use another location with the -d flag.

pcmoritz on 13 Dec 2017

What does it store there? Does it store whole objects and then mmaps them? Because /tmp is not in memory on Mac.

mitar on 13 Dec 2017

So we had the same suspicion and did performance experiments with this, it behaves very much like it is in-memory. We are actually unlinking the file before writing anything, so maybe that prevents flushing to disk. This strategy is the same that Google Chrome is using for it's shared memory.

pcmoritz on 13 Dec 2017

Do both containers have to have access to the same /dev/shm or do you send a file descriptor over the socket? Does /dev/shm have to be large (larger than -m parameter)?

mitar on 13 Dec 2017

The file descriptor is sent over the socket. That's a good point, probably that's what makes it work. And yes, /dev/shm needs to be larger than the -m parameter, otherwise and error is raised, see https://github.com/apache/arrow/blob/master/cpp/src/plasma/store.cc#L810.

pcmoritz on 13 Dec 2017

OK, the above is 10000000 which is around 10 MB which is less than 64 MB default /dev/shm in Docker.

Yea, I would also suspect so. So I would assume that object is stored in /dev/shm of the Docker container which created it, and others then just access it over the file descriptor. I think we should test what happens if the container which created the object is stopped. And who is responsible for cleaning file descriptors up.

mitar on 13 Dec 2017

The beauty here is that the OS does refcounting on the file descriptors and will release the resources when the last refcount goes out of scope. That's why we went through the pain of making the file descriptor sending work and unlinking the original file, so the combination of these make sure there is no garbage left behind.

Not sure what happens in the docker container case however, does the host OS do the refcounting in that case and everything magically works? I don't know but suspect so. Let me know if you plan to look into this!

pcmoritz on 13 Dec 2017

Disregarding the details, I'm extremely happy to learn that it works in the case of multiple docker containers. That's really great :)

So in the future we could use docker to get isolation between workers! And if the things stored in the object store are not pickled and use arrow data instead (pickle could be deactivated), it might even be possible to get some level of security from this if you trust docker's isolation.

pcmoritz on 13 Dec 2017

An issue right now is that Kubernetes doesn't have an equivalent to --shm-size just yet (kubernetes-28272).

remram44 on 13 Dec 2017

Ok, so again running on GKE, I could get plasma to run just fine with shm (staying under the default Docker size of 64MB, and using 20-byte object IDs), but no luck with hugepages. Support seems to be upcoming (alpha in 1.8; see here).

Can I use the -d option without -h to specify an alternate location for /dev/shm? So I can provide a bigger shm as a volume?

remram44 on 13 Dec 2017

Mounting a bigger shm from the host, either as /dev/shm in the container or somewhere else and using -d to point to it, allows me to use a bigger -m value that 64MB (as per this openshift workaround).

So I guess plasma is usable on Docker and Kubernetes after all, just no hugepages?

Allowing the Plasma store to use up to 0.01GB of memory.
Starting object store with directory /mnt/hugepages and huge page support enabled
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
...
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
There is not enough space to create this object, so evicting 0 objects to free up 0 bytes.
Disconnecting client on fd 5

remram44 on 13 Dec 2017

@remram44 , in the log above, your plasma store is starting with 0.010GB, that's 10MB. Hugepages in the plasma store start working with 1GB minimum memory allocation:
https://github.com/apache/arrow/blob/master/cpp/src/plasma/store.cc#L820

atumanov on 13 Dec 2017

@remram44 , if you are sure you are dealing with 2MB hugepages, you could try overriding that 1GB default with, say 10MB instead, to fit your memory configuration.

atumanov on 13 Dec 2017

You mean that plasma doesn't work with hugepages if the -m value is below 1GB?

remram44 on 13 Dec 2017

yes, I believe that's correct, but it's a one line change. I think we could log an error message on startup if the specified -m value is < 1GB when -h is also specified. We might've decided against it because 1GB is not fundamental. It's a safe default that will work for both 2MB and 1GB pages. With 2MB hugepages as the more popular/widespread option, that default can be changed. We felt that 1GB would be a more robust out-of-box default when the hugepage size on the target platform is unknown.

atumanov on 13 Dec 2017

Same error running with -m 2000000000 -h unfortunately.

remram44 on 13 Dec 2017

@remram44 , did you set up the mount point inside docker containers to be backed by hugetlbfs? I'm not sure if you've gone through the process of setting up the mount point, here's the link:
http://ray.readthedocs.io/en/latest/plasma-object-store.html
Things to check:

is the directory specified with -d visible inside the container and backed by hugetlbfs? You should be able to touch files in there.
what's the gid of the plasma store process? Does it match cat /proc/sys/vm/hugetlb_shm_group
what's the number of huge pages allocated? What's the output of cat /proc/sys/vm/nr_hugepages

All of this -- inside the docker container running the plasma store. I haven't tried it in the docker container, so it's not officially supported, but let's see if we can make it work together :)

atumanov on 13 Dec 2017

An issue right now is that Kubernetes doesn't have an equivalent to --shm-size just yet

This is just one more reason why we should use huge pages instead of /dev/shm.

mitar on 13 Dec 2017

Using emptyDir with medium = Memory seems reasonable. But how do you configure the size of the volume? Or is it just unlimited (all memory) unless specified? So how large it is if you look manually into the size of it?

Can you use emptyDir across pods? Or is it not necessary and file descriptor sharing works?

mitar on 13 Dec 2017

@remram44 @mitar, this was a while ago, but how did you end up resolving this? Were you able to get something working with shared memory between pods?

robertnishihara on 11 Jan 2018

Please reopen if there are more questions/updates.

robertnishihara on 2 Feb 2018

My team is interested in the possibility of using plasma as a way of transferring data between pods - @remram44 @mitar just checking to see if you ever got this working?

metasyn on 10 May 2019

Yes. It works well. We just have a host-local directory we mount to all pods which we use for plasma socket between pods.

mitar on 11 May 2019

How precisely to configure this host local directory in scalable way I have not yet found a good solution though. If you want your pods to run on multiple nodes. Some of my notes I wrote about this:

There seems to be two ways to achieve this:

Using inter-pod affinity:
- Pros: It can work across namespaces. So if we will end up running each pod in its own namespace (#4) we can still schedule pods together. Then they can simply use a shared host directory.
- Cons: We would have to modify provided pod configurations for each pair to add this affinity configuration to their specs. This should not be too tricky though and can probably be a simple YAML transformation.
- Questions: Do we have to modify pod configuration or can we attach affinity configuration in some other way to pods (maybe through some other Kubernetes objects which then depend on pods by performers).
Using local persistent storage. It allows one to expose a local directory as a persistent volume. Then each pod can reference this same persistent volume claim and this makes both pods be scheduled on the same node.
- Pros: We can expose those claims through pod preset so pods can simply use those as any other volumes.
- Cons: It seems the same claim cannot be used across namespaces.
- Questions: It is unclear to me how exactly one identifies which claim name to use. Do you create a claim name per node? It seems that maybe this works if you have two pods which can have multiple copies of them, and then they are both paired together, pairwise. But you cannot use the same claim for different pairs and expect things to just work. So it seems this makes that each pair needs its own local volume and a related claim and this means this is very similar to manually scheduling pairs to nodes. In that case it is probably easier to just use node affinity.