I am trying to use plasma to store datasets in memory, and share them between pods.
I find that this does not work well, and in particular, plasma.get/plasma.put tends to hang with no specific error message.
I am sure other people have tried this setup, I would love to hear about their experience.
The setup is:
-h). shm hangs if get()ing an object submitted by another client, hugepages seems to complain about missing huge pages when mmappingNote that I could get this running using Docker containers just fine. I understand that some of those issues are due to Kubernetes more than plasma, but I would love some pointers.
cc @mitar
Hey @remram44 thanks for bringing this up! Do you have Kubernetes scripts to reproduce this and instructions how to set it up on EC2 and reproduce this issue? Any pointers are welcome.
@remram44 how did you get it working between Docker containers? Did you have to do anything special?
On Docker, I didn't have to do anything. I ran with native Docker on macOS. However on doing this again it seems to work if I don't pass an explicit ObjectID to put(), otherwise get() hangs.
Server:
docker run -ti --rm --name plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow plasma_store -s /mnt/socket/plasma -m 10000000
Sender:
docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.put("hello, world").binary())'
b'\x10\x85\x1b\xc6\xe3\xc6\x9f\x8d\x13\x1e\xa7\xda\xf3\xd9\xf0\x0cZ\xf1\xd7/'
Getter:
docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.get(plasma.ObjectID(b"\x10\x85\x1b\xc6\xe3\xc6\x9f\x8d\x13\x1e\xa7\xda\xf3\xd9\xf0\x0cZ\xf1\xd7/")))'
hello, world
Explicit sender:
docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); client.put("hello, world", plasma.ObjectID(b"testidhere"))'
Getter:
docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.get(plasma.ObjectID(b"testidhere")))'
<hangs>
I ran this on Kubernetes on Google Cloud with this configuration:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: plasmaserver
spec:
replicas: 1
template:
metadata:
labels:
thing: plasmaserver
spec:
containers:
- name: main
image: remram/python3-pyarrow
command: ['/bin/sh', '-c', 'plasma_store -s /mnt/socket/plasma -m 10000000']
# or
# command: ['/bin/sh', '-c', 'plasma_store -s /mnt/socket/plasma -m 10000000 -d /mnt/hugepages -h']
volumeMounts:
- mountPath: /mnt/socket
name: socket
- mountPath: /mnt/hugepages
name: hugepages
volumes:
- name: socket
persistentVolumeClaim:
claimName: plasmasocketvc
- name: hugepages
persistentVolumeClaim:
claimName: hugepagesvc
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: plasma1
spec:
replicas: 1
template:
metadata:
labels:
thing: plasma1
spec:
containers:
- name: main
image: remram/python3-pyarrow
command: ['/bin/sh', '-c', 'while true; do sleep 30; done']
volumeMounts:
- mountPath: /mnt
name: socket
volumes:
- name: socket
persistentVolumeClaim:
claimName: plasmasocketvc
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: plasma2
spec:
replicas: 1
template:
metadata:
labels:
thing: plasma2
spec:
containers:
- name: main
image: remram/python3-pyarrow
command: ['/bin/sh', '-c', 'while true; do sleep 30; done']
volumeMounts:
- mountPath: /mnt
name: socket
volumes:
- name: socket
persistentVolumeClaim:
claimName: plasmasocketvc
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: plasmasocketv
labels:
thing: plasmasocket
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteMany
hostPath:
path: "/var/plasma-rr4"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: plasmasocketvc
spec:
storageClassName: ""
selector:
matchLabels:
thing: plasmasocket
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: hugepagesv
labels:
thing: hugepages
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteMany
hostPath:
path: "/var/hugepages"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: hugepagesvc
spec:
storageClassName: ""
selector:
matchLabels:
thing: hugepages
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
Then I ran commands on plasma1 and plasma2 using kubectl exec.
@remram44 Thanks! The hanging you are seeing is unrelated to using docker. It hangs because ObjectIDs need to be exactly 20 bytes long. So even without docker, this hangs:
In [5]: client.put("hello", plasma.ObjectID(b"hi"))
Out[5]: ObjectID(68690000537f0000300000000000000091010000)
In [6]: client.get(plasma.ObjectID(b"hi"))
Whereas this works:
In [3]: client.put("hello", plasma.ObjectID(20 * b"h"))
Out[3]: ObjectID(6868686868686868686868686868686868686868)
In [4]: client.get(plasma.ObjectID(20*b"h"))
Out[4]: 'hello'
Can you try if this also fixes the problem in Kubernetes?
This probably should be raising a ValueError 馃槄 but I agree that it's a separate problem. I'll try again with valid IDs.
I am surprised that your Docker example even works. Because Plasma store uses /dev/shm as a default to share objects on Linux, but that is not shared between containers in Docker. So your server and client do not have the same /dev/shm. So I am not sure how communication would work here?
Hanging on invalid ObjectID is really surprising. :-)
(It is interesting that GitHub colors the invalid ObjectID with red background?)
I don't know why it is red :)
I agree it is not good behaviour and should give an error. I submitted a JIRA ticket here and will fix it ASAP: https://issues.apache.org/jira/browse/ARROW-1919
Thanks for finding the problem!
@pcmoritz: Do you understand why sharing works between containers even if /dev/shm is not shared?
I do not understand it and have not tried it, but it seems to be possible to share memory between docker containers in general, see https://stackoverflow.com/questions/29173193/shared-memory-with-docker-containers-docker-version-1-4-1
It seems we would have to use --ipc argument, but example above is not. This is why I am confused. @remram44, which Docker version are you using? If you go to two docker containers and you create a file in its /dev/shm, does it appear in the other container?
Also, @pcmoritz, is /dev/shm being used by Plasma store or is memory sharing done in some other way?
By default on linux it is using /dev/shm and on mac it is using /tmp/ and can be configured to use another location with the -d flag.
What does it store there? Does it store whole objects and then mmaps them? Because /tmp is not in memory on Mac.
So we had the same suspicion and did performance experiments with this, it behaves very much like it is in-memory. We are actually unlinking the file before writing anything, so maybe that prevents flushing to disk. This strategy is the same that Google Chrome is using for it's shared memory.
Do both containers have to have access to the same /dev/shm or do you send a file descriptor over the socket? Does /dev/shm have to be large (larger than -m parameter)?
The file descriptor is sent over the socket. That's a good point, probably that's what makes it work. And yes, /dev/shm needs to be larger than the -m parameter, otherwise and error is raised, see https://github.com/apache/arrow/blob/master/cpp/src/plasma/store.cc#L810.
OK, the above is 10000000 which is around 10 MB which is less than 64 MB default /dev/shm in Docker.
Yea, I would also suspect so. So I would assume that object is stored in /dev/shm of the Docker container which created it, and others then just access it over the file descriptor. I think we should test what happens if the container which created the object is stopped. And who is responsible for cleaning file descriptors up.
The beauty here is that the OS does refcounting on the file descriptors and will release the resources when the last refcount goes out of scope. That's why we went through the pain of making the file descriptor sending work and unlinking the original file, so the combination of these make sure there is no garbage left behind.
Not sure what happens in the docker container case however, does the host OS do the refcounting in that case and everything magically works? I don't know but suspect so. Let me know if you plan to look into this!
Disregarding the details, I'm extremely happy to learn that it works in the case of multiple docker containers. That's really great :)
So in the future we could use docker to get isolation between workers! And if the things stored in the object store are not pickled and use arrow data instead (pickle could be deactivated), it might even be possible to get some level of security from this if you trust docker's isolation.
An issue right now is that Kubernetes doesn't have an equivalent to --shm-size just yet (kubernetes-28272).
Ok, so again running on GKE, I could get plasma to run just fine with shm (staying under the default Docker size of 64MB, and using 20-byte object IDs), but no luck with hugepages. Support seems to be upcoming (alpha in 1.8; see here).
Can I use the -d option without -h to specify an alternate location for /dev/shm? So I can provide a bigger shm as a volume?
Mounting a bigger shm from the host, either as /dev/shm in the container or somewhere else and using -d to point to it, allows me to use a bigger -m value that 64MB (as per this openshift workaround).
So I guess plasma is usable on Docker and Kubernetes after all, just no hugepages?
Allowing the Plasma store to use up to 0.01GB of memory.
Starting object store with directory /mnt/hugepages and huge page support enabled
mmap failed with error: Cannot allocate memory
(this probably means you have to increase /proc/sys/vm/nr_hugepages)
mmap failed with error: Cannot allocate memory
(this probably means you have to increase /proc/sys/vm/nr_hugepages)
...
mmap failed with error: Cannot allocate memory
(this probably means you have to increase /proc/sys/vm/nr_hugepages)
There is not enough space to create this object, so evicting 0 objects to free up 0 bytes.
Disconnecting client on fd 5
@remram44 , in the log above, your plasma store is starting with 0.010GB, that's 10MB. Hugepages in the plasma store start working with 1GB minimum memory allocation:
https://github.com/apache/arrow/blob/master/cpp/src/plasma/store.cc#L820
@remram44 , if you are sure you are dealing with 2MB hugepages, you could try overriding that 1GB default with, say 10MB instead, to fit your memory configuration.
You mean that plasma doesn't work with hugepages if the -m value is below 1GB?
yes, I believe that's correct, but it's a one line change. I think we could log an error message on startup if the specified -m value is < 1GB when -h is also specified. We might've decided against it because 1GB is not fundamental. It's a safe default that will work for both 2MB and 1GB pages. With 2MB hugepages as the more popular/widespread option, that default can be changed. We felt that 1GB would be a more robust out-of-box default when the hugepage size on the target platform is unknown.
Same error running with -m 2000000000 -h unfortunately.
@remram44 , did you set up the mount point inside docker containers to be backed by hugetlbfs? I'm not sure if you've gone through the process of setting up the mount point, here's the link:
http://ray.readthedocs.io/en/latest/plasma-object-store.html
Things to check:
-d visible inside the container and backed by hugetlbfs? You should be able to touch files in there.cat /proc/sys/vm/hugetlb_shm_groupcat /proc/sys/vm/nr_hugepagesAll of this -- inside the docker container running the plasma store. I haven't tried it in the docker container, so it's not officially supported, but let's see if we can make it work together :)
An issue right now is that Kubernetes doesn't have an equivalent to --shm-size just yet
This is just one more reason why we should use huge pages instead of /dev/shm.
Using emptyDir with medium = Memory seems reasonable. But how do you configure the size of the volume? Or is it just unlimited (all memory) unless specified? So how large it is if you look manually into the size of it?
Can you use emptyDir across pods? Or is it not necessary and file descriptor sharing works?
@remram44 @mitar, this was a while ago, but how did you end up resolving this? Were you able to get something working with shared memory between pods?
Please reopen if there are more questions/updates.
My team is interested in the possibility of using plasma as a way of transferring data between pods - @remram44 @mitar just checking to see if you ever got this working?
Yes. It works well. We just have a host-local directory we mount to all pods which we use for plasma socket between pods.
How precisely to configure this host local directory in scalable way I have not yet found a good solution though. If you want your pods to run on multiple nodes. Some of my notes I wrote about this:
There seems to be two ways to achieve this:
@mitar thank you for the detailed response :)
Most helpful comment
This probably should be raising a ValueError 馃槄 but I agree that it's a separate problem. I'll try again with valid IDs.