We did performance tests with rook on AWS.
We use 6 i3.4xlarge instances for ceph storage (each has 2 x 1900 NVMe SSD ) and cluster config
apiVersion: rook.io/v1alpha1
kind: Cluster
metadata:
name: rook
spec:
backend: ceph
dataDirHostPath: /var/lib/rook
hostNetwork: false
monCount: 3
storage: # cluster level storage configuration and selection
useAllNodes: true
useAllDevices: false
deviceFilter: "nvme[0-9]n1"
metadataDevide:
location:
storeConfig:
storeType: bluestore
We used next command:
while true ; do echo -n `date +%d-%m-%Y-%H:%M:%S`; echo ; fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=50 --iodepth=16 --numjobs=16 --runtime=5 --group_reporting --name=4ktest --size=128M | egrep "(read[ ]*:|write[ ]*:)"; echo ; sleep 1 ; done
When we use pool:
apiVersion: rook.io/v1alpha1
kind: Pool
metadata:
name: replicapool
spec:
failureDomain: osd
crushRoot: default
replicated:
size: 2
and mount rbd device to pod we have good performance
read: IOPS=19.6k, BW=76.4MiB/s (80.1MB/s)(382MiB/5004msec)
[...]
write: IOPS=19.5k, BW=76.1MiB/s (79.8MB/s)(381MiB/5004msec)
When we use filesystem:
apiVersion: rook.io/v1alpha1
kind: Filesystem
metadata:
name: file-fs
spec:
metadataPool:
replicated:
size: 2
dataPools:
- replicated:
size: 2
metadataServer:
activeCount: 3
activeStandby: true
we had a big drop in performance:
read: IOPS=6374, BW=24.9MiB/s (26.1MB/s)(125MiB/5002msec)
[...]
write: IOPS=6430, BW=25.1MiB/s (26.3MB/s)(126MiB/5002msec)
Is this performance fall inevitable for shared filesystem or we can somehow avoid it? How can we boost cephfs?
I think you would need to do far more testing for your specific use case to determine if there is an actual 3x performance degradation. In this particular test you are trying to keep 16 threads 16 deep in the queue doing small random I/Os to a single file. Is that actually the IO pattern of your workload?
Without doing any analysis, I'd guess what you are seeing here is the performance differences in the Ceph RBD kernel client, vs CephFS using FUSE ( I believe that rook uses ceph-fuse to mount CephFS ). RBD uses its own caching implementation since it can't use the page cache and is probably getting a lot more cache hits. FUSE can have performance issues with small random reads due to its single threaded userspace daemon.
Generally speaking, you'll see somewhat better performance from RBD vs CephFS because with RBD all of the file system metadata is managed at the client side, whereas most CephFS metadata updates require a round trip to the MDS (it's a shared file system).
Whether it's inevitable depends on your workload. For things like streaming writes you should see similar throughput. What are you using to generate those IOPS number? Is that fio?
^ Listen to that guy
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Most helpful comment
Generally speaking, you'll see somewhat better performance from RBD vs CephFS because with RBD all of the file system metadata is managed at the client side, whereas most CephFS metadata updates require a round trip to the MDS (it's a shared file system).
Whether it's inevitable depends on your workload. For things like streaming writes you should see similar throughput. What are you using to generate those IOPS number? Is that fio?