Currently, HashiCorp Nomad implements a subset of the CSI spec found here. This feature request is for the implementation of the volume related methods.
At the bottom of this feature request I have included an overview of the CSI spec methods as well as the ones Nomad has currently implemented.
Ceph is an Open Source scalable distributed storage system that is widely adopted among enterprises and strongly liked for its scalability and performance.
Implementing the missing CSI spec methods would enable the use of the ceph-csi driver in Nomad to dynamically provision persistent volumes hosted on Ceph for Nomad workloads.
This would bring Nomad on par with other CO like Kubernetes and Mesos.
There have been attempts by the community to try and get the ceph-csi <==> Nomad communication to work.
The problem boils down to Nomad not having fully implemented the CSI spec methods for all the Controller capabilities and Node capabilities the ceph-csi driver (and others) offers. The ceph-csi driver expects volumes to be dynamically provisioned and mounted on the fly (see here) and not being created beforehand and explicitly put into e.g. the Nomad Volume Specification.
The ceph-csi driver currently implements the following CSI Spec capabilities:
Note: The CSI methods for RBD and CephFS may differ in implementation, but adhere to the same CSI specification.
| Ceph component | CSI Type | CSI Capabilities | Related CSI Methods | Implemented by Nomad?
|-------- |-------------- |------ | ------- | -------
| RBD | Controller | CREATE_DELETE_VOLUME | CreateVolume | :x: |
| | | | DeleteVolume | :x: |
| RBD | Controller | CREATE_DELETE_SNAPSHOT | CreateSnapshot | :x: |
| | | | DeleteSnapshot | :x: |
| RBD | Controller | CLONE_VOLUME | CreateVolume | :x: |
| RBD | Controller | EXPAND_VOLUME | ControllerExpandVolume | :x: |
| RBD | Node | STAGE_UNSTAGE_VOLUME | NodeStageVolume | :heavy_check_mark: |
| | | | NodeUnstageVolume | :heavy_check_mark: |
| RBD | Node | GET_VOLUME_STATS | NodeGetVolumeStats | :x: |
| RBD | Node | EXPAND_VOLUME | NodeExpandVolume | :x: |
| Ceph component | CSI Type | CSI Capabilities | Related CSI Methods | Implemented by Nomad?
|-------- |-------------- |------ | ------- | -------
| CephFS | Controller | CREATE_DELETE_VOLUME | CreateVolume | :x: |
| | | | DeleteVolume | :x: |
| CephFS | Controller | EXPAND_VOLUME | ControllerExpandVolume | :x: |
| CephFS | Node | STAGE_UNSTAGE_VOLUME | NodeStageVolume | :heavy_check_mark: |
| | | | NodeUnstageVolume | :heavy_check_mark: |
| CephFS | Node | GET_VOLUME_STATS | NodeGetVolumeStats | :x: |
The easier part of the fix is the actual (dummy) implementation of the missing CSI methods in the Nomad client.
The harder part of the fix is how to implement a logical flow in e.g. the Job Stanza and the Volume Specification to service both the creation and mounting of NEW volumes through the CreateVolume API (like ceph-csi expects it) as well as the mounting of EXISTING volumes (like this Nomad example).
Of course I would be more than willing to give a shot at helping to implement the missing methods. Getting the flow right from HCL to the mounting of volumes in Nomad workloads ties more into the core of Nomad and I feel that should be left up to you guys to implement.
| Identity Method | Implemented? | Nomad Link |
|-------- |-------------- |------ |
| GetPluginInfo | :heavy_check_mark: | GetPluginInfo |
| GetPluginCapabilities | :heavy_check_mark: | GetPluginCapabilities |
| Probe | :heavy_check_mark: | Probe |
| Controller Method | Implemented? | Nomad Link | CSI Spec Link
|-------- |-------------- |------ | ----------
| CreateVolume | :x: | | CreateVolume |
| DeleteVolume | :x: | | DeleteVolume |
| ControllerPublishVolume | :heavy_check_mark: | ControllerPublishVolume |
| ControllerUnpublishVolume | :heavy_check_mark: | ControllerUnpublishVolume |
| ValidateVolumeCapabilities | :heavy_check_mark: | ValidateVolumeCapabilities |
| ListVolumes | :x: | | ListVolumes |
| GetCapacity | :x: | | GetCapacity |
| ControllerGetCapabilities | :heavy_check_mark: | ControllerGetCapabilities |
| CreateSnapshot | :x: | | CreateSnapshot |
| DeleteSnapshot | :x: | | DeleteSnapshot |
| ListSnapshots | :x: | | ListSnapshots |
| ControllerExpandVolume | :x: | | ControllerExpandVolume |
| ControllerGetVolume | :x: | | ControllerGetVolume |
| Node Method | Implemented? | Nomad Link | CSI Spec Link
|-------- |-------------- |------ | -------
| NodeStageVolume | :heavy_check_mark: | NodeStageVolume |
| NodeUnstageVolume | :heavy_check_mark: | NodeUnstageVolume |
| NodePublishVolume | :heavy_check_mark: | NodePublishVolume |
| NodeUnpublishVolume | :heavy_check_mark: | NodeUnpublishVolume |
| NodeGetVolumeStats | :x: | | NodeGetVolumeStats |
| NodeExpandVolume | :x: | | NodeExpandVolume |
| NodeGetCapabilities | :heavy_check_mark: | NodeGetCapabilities |
| NodeGetInfo | :heavy_check_mark: | NodeGetInfo |
Hi @sbouts, thanks for opening this issue! Yes as you've noted we didn't implement the volume creation workflow. We had this slated for a potential Phase 2 of work but I don't think we realized this was a blocker for using plugins like Ceph.
As you've noted here, there's a good bit of work required to make this happen, so I'm going to tag my colleagues @galeep and @yishan-lin to make sure it's on their radar as some of our planning gets solidified.
This feature request from on the terraform nomad provider is also relevant for this work.
https://github.com/terraform-providers/terraform-provider-nomad/issues/102
Thanks for the link @ryanmickler. That feature request is to cover the existing Nomad registration APIs, not the CSI APIs being discussed here.
Is there a seperate issue to implement the parameters block, not the entire volume creation spec? Currently, that's all thats limiting use of the ceph-csi plugin
see here:
https://github.com/hashicorp/nomad/issues/7668#issuecomment-645878741
@ryanmickler reading that comment and then digging into https://github.com/container-storage-interface/spec/issues/387 it doesn't look like this is part of the CSI spec yet for the RPCs we support, is it?
A quick inspection looks like NodeStageVolume (mounts the volume to a staging path on the node.)
https://github.com/ceph/ceph-csi/blob/47d5b60af8d48574ff6d11ca37dbff5a6f56815b/internal/rbd/nodeserver.go#L116
is calling genVolFromVolumeOptions on line 171
https://github.com/ceph/ceph-csi/blob/47d5b60af8d48574ff6d11ca37dbff5a6f56815b/internal/rbd/nodeserver.go#L171
Then inside that function, we are hitting our missing required parameter pool error here:
https://github.com/ceph/ceph-csi/blob/be9e7cf956c378227ff43e0194410468919766b7/internal/rbd/rbd_util.go#L694
Support for those fields were added in https://github.com/hashicorp/nomad/pull/7957
right, perhaps those parameters arent getting passed to NodeStageVolume properly. Because pool = "<value>" is set in my config, and I still get:
E0728 04:47:20.003465 1 utils.go:163] ID: 23 Req-ID: csi-test-0 GRPC error: rpc error: code = Internal desc = missing required parameter pool
Maybe i need in investigate further. thanks for the help and your work on this!
@ryanmickler we should dig into this but let's not clutter up this feature request issue with debugging that. Can you open a new issue and include your volume spec and any relevant logs?