Incubator-mxnet: Does MXNet support RDMA over Converged Ethernet (ROCE)

Created on 13 Apr 2017  路  7Comments  路  Source: apache/incubator-mxnet

Does MXNet support RDMA over Converged Ethernet (ROCE) when the communication between ps and
workers?

Most helpful comment

Hi all, TensorFlow GDR developer here (https://github.com/tensorflow/tensorflow/pull/11392). We are working actively on integrating RDMA verbs natively with MXNet/pslite. While we will send the PR when it is ready, I am wondering if MXNet community has any major concerns in accepting such feature.

All 7 comments

Since MXNet uses Zeromq libary in ps-lite for inter-node communication, personally I don't think RDMA is supported yet. The only RDMA-related discussion on Zeromq dated back to 2011.

https://lists.zeromq.org/pipermail/zeromq-dev/2011-December/014502.html

In that replied, Gabriele pointed out some misconceptions on RDMA, RoCE and IB verbs. His dissatisfactory on the performance of SDP (an alternative way to use the RDMA-enable network adapters), is described briefly.

http://zeromq.org/results:ib-tests-v206
http://mvapich.cse.ohio-state.edu/performance/pt_to_pt/

Comparing the performance numbers in the links above is unfair. Both hardware and software environemtn on the zmq platform is outdated while MVAPICH2 runs on the latest device. However, the numbers still reveal the current status of RDMA on these two libs:

  1. ~0.8million msg/s 888MB/s throughput 36mus latency @4kB for zmq;
  2. ~2million msg/s 2GB/s throughput < 2mus latency @4kB for MVAPICH2.

But better performance does NOT necessarily mean better choice. I think MXNet team make tradeoffs between many aspects. I appreciate if they can share their experiences on this design choice.

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!

Hi all, TensorFlow GDR developer here (https://github.com/tensorflow/tensorflow/pull/11392). We are working actively on integrating RDMA verbs natively with MXNet/pslite. While we will send the PR when it is ready, I am wondering if MXNet community has any major concerns in accepting such feature.

@byronyi Cool. I've got a hobby project which may bring RDMA to MXNet/pslite too. Regarding the the GDR approach to RDMA-enable MXNet, does it require a "Mallenox Infiniband + NVIDIA GPU" platform? I wonder if a x86 platform like Intel CPU or Xeon Phi Manycore Processors with Omni-Path networks, will benefit from it?

@weijianwen We are primarily targeting on RoCEv2 deployment, but in principle it should require no modification to support RoCEv1, InfiniBand and iWARP. We do not plan to support GPU/Xeon Phi Direct RDMA at this stage, as it seems inter-node communication, i.e. pslite, is largely agnostic to the actual worker processors. It might require significant refactoring or re-design on MXNet's side as well, and I do hope you MXNet developers could shed some light on this direction.

I know little about Omni-Path, but I heard that Intel does support the same verbs interface as defined in RDMA specification (RFC 5040).

@byronyi sounds like you're replacing TCP/IP with RoCE semantics on whcih pslite (perhaps ZeroMQ more specifically) relies. We got similar observation that pslite is agnostic to usage cases (CPUs or GPUs or muliti-node reduction) thus is rather straight-forward to adapt pslite onto a RDMA-enable fabric. However, rewriting pslite only, without reconsidering data communication pattern like that in GDR or NCCL, may lose opportunities of further optimization given that uppper-layer info is absent.

I think here are some questions worth considering when designing RDMA-enable MXNet, hopefully getting some insights from MXNet community and byronyi.

  1. Shall we build MXNet agnostic to network fabrics (Ethernet, RoCE, Infiniband), or build ones tailored for specific fabrics?
  2. Which approach is favored? 1) Add RDMA plugin to ZeroMQ like what @byronyi does to gRPC and TensorFlow; 2) or simply replace ZeroMQ with RoCE message passing semantics.
  3. Which parts for MXNet need redesign when porting to RDMA-enable networks? (Personally I think Van in pslite is heavily infuenced by ZeroMQ's APIs, and sometimes it looks weird for me.)

Hi all, please see https://github.com/dmlc/ps-lite/pull/124 for our PR. Many thanks to my colleagues @crazyboycjr and @snowzjx for their design and implementation.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xzqjack picture xzqjack  路  3Comments

WangcsShuai picture WangcsShuai  路  3Comments

qiliux picture qiliux  路  3Comments

luoruisichuan picture luoruisichuan  路  3Comments

ranti-iitg picture ranti-iitg  路  3Comments