Rook: 3x performance degradation with cephfs vs rbd

Created on 8 May 2018 · 5Comments · Source: rook/rook

We did performance tests with rook on AWS.
We use 6 i3.4xlarge instances for ceph storage (each has 2 x 1900 NVMe SSD ) and cluster config

apiVersion: rook.io/v1alpha1
kind: Cluster
metadata:
  name: rook
spec:
  backend: ceph
  dataDirHostPath: /var/lib/rook
  hostNetwork: false
  monCount: 3
  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: false
    deviceFilter: "nvme[0-9]n1"
    metadataDevide: 
    location:
    storeConfig:
      storeType: bluestore

We used next command:

while true ; do echo -n `date +%d-%m-%Y-%H:%M:%S`; echo ; fio  --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=50 --iodepth=16 --numjobs=16 --runtime=5 --group_reporting --name=4ktest --size=128M  | egrep "(read[ ]*:|write[ ]*:)"; echo  ; sleep 1 ; done

When we use pool:

apiVersion: rook.io/v1alpha1
kind: Pool
metadata:
  name: replicapool
spec:
  failureDomain: osd
  crushRoot: default
  replicated:
    size: 2

and mount rbd device to pod we have good performance

read: IOPS=19.6k, BW=76.4MiB/s (80.1MB/s)(382MiB/5004msec)
[...]
write: IOPS=19.5k, BW=76.1MiB/s (79.8MB/s)(381MiB/5004msec)

When we use filesystem:

apiVersion: rook.io/v1alpha1
kind: Filesystem
metadata:
  name: file-fs
spec:
  metadataPool:
    replicated:
      size: 2    
  dataPools:
    - replicated:
        size: 2    
  metadataServer:
    activeCount: 3
    activeStandby: true

we had a big drop in performance:

read: IOPS=6374, BW=24.9MiB/s (26.1MB/s)(125MiB/5002msec)
[...]
write: IOPS=6430, BW=25.1MiB/s (26.3MB/s)(126MiB/5002msec)

Is this performance fall inevitable for shared filesystem or we can somehow avoid it? How can we boost cephfs?

ods.log

ceph filesystem performance wontfix

Source

aocheretnoy

Most helpful comment

Generally speaking, you'll see somewhat better performance from RBD vs CephFS because with RBD all of the file system metadata is managed at the client side, whereas most CephFS metadata updates require a round trip to the MDS (it's a shared file system).

Whether it's inevitable depends on your workload. For things like streaming writes you should see similar throughput. What are you using to generate those IOPS number? Is that fio?

liewegas on 23 May 2018

👍8

All 5 comments

I think you would need to do far more testing for your specific use case to determine if there is an actual 3x performance degradation. In this particular test you are trying to keep 16 threads 16 deep in the queue doing small random I/Os to a single file. Is that actually the IO pattern of your workload?

Without doing any analysis, I'd guess what you are seeing here is the performance differences in the Ceph RBD kernel client, vs CephFS using FUSE ( I believe that rook uses ceph-fuse to mount CephFS ). RBD uses its own caching implementation since it can't use the page cache and is probably getting a lot more cache hits. FUSE can have performance issues with small random reads due to its single threaded userspace daemon.

civik on 23 May 2018

👍1

Whether it's inevitable depends on your workload. For things like streaming writes you should see similar throughput. What are you using to generate those IOPS number? Is that fio?

liewegas on 23 May 2018

👍8

^ Listen to that guy

civik on 23 May 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.