Horovod: Allgather vs Allreduce

Created on 5 Dec 2018  路  3Comments  路  Source: horovod/horovod

I am aware that allgather has performance benefits over allreduce when aggregating sparse gradients (using indexedslices in tensorflow) and allreduce over allgather for dense ones. What is the reason for this?

question

Most helpful comment

Hey @UditGupta10, it comes down to the way the data is represented.

Suppose you have a tensor with shape (100, 100), but only 10 of those 10,000 (1%) of those elements are non-zero. It's up to you whether you represent this as a dense tensor (array) or a sparse tensor (dictionary).

You cannot allreduce a sparse tensor. The results will simply not make any sense, because the sparse tensor has two lists that need to be communicated: the keys and the values. It doesn't make sense to sum up the keys, and without having all the keys together we don't know which values should be summed with which other values.

One way to workaround this is to convert the sparse tensor into a dense tensor and allreduce it. If we use the example tensor, then this means a 100x increase in the memory footprint of our tensor. That's a lot more expensive to send over the network. But if our sparse tensor were 99% non-zero, this may not be such an expensive thing to do (which is why the Horovod API provides an option to convert sparse tensors to dense tensors during allreduce).

The alternative is to keep the tensors as sparse, but to allgather them instead of allreduce. Suppose you have 10 workers in your ring. Then allgathering a 10-element sparse tensor across all 10 workers only results in 100 keys and 100 values, about 20% the cost of creating the dense tensor, in terms of memory footprint.

So that's why we generally use allgather for sparse tensors. Does that all make sense?

All 3 comments

Hey @UditGupta10, it comes down to the way the data is represented.

Suppose you have a tensor with shape (100, 100), but only 10 of those 10,000 (1%) of those elements are non-zero. It's up to you whether you represent this as a dense tensor (array) or a sparse tensor (dictionary).

You cannot allreduce a sparse tensor. The results will simply not make any sense, because the sparse tensor has two lists that need to be communicated: the keys and the values. It doesn't make sense to sum up the keys, and without having all the keys together we don't know which values should be summed with which other values.

One way to workaround this is to convert the sparse tensor into a dense tensor and allreduce it. If we use the example tensor, then this means a 100x increase in the memory footprint of our tensor. That's a lot more expensive to send over the network. But if our sparse tensor were 99% non-zero, this may not be such an expensive thing to do (which is why the Horovod API provides an option to convert sparse tensors to dense tensors during allreduce).

The alternative is to keep the tensors as sparse, but to allgather them instead of allreduce. Suppose you have 10 workers in your ring. Then allgathering a 10-element sparse tensor across all 10 workers only results in 100 keys and 100 values, about 20% the cost of creating the dense tensor, in terms of memory footprint.

So that's why we generally use allgather for sparse tensors. Does that all make sense?

Thanks for the detailed explanation.

@UditGupta10, we've noticed that some TensorFlow models that use embeddings have issues (#554, #566, #430). Until the underlying issue is fixed, the feature in #570 may help alleviate the problem.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

michaelstjules picture michaelstjules  路  3Comments

zanonShao picture zanonShao  路  3Comments

tvovalentin picture tvovalentin  路  3Comments

kit1980 picture kit1980  路  3Comments

dhaners picture dhaners  路  3Comments