Horovod: How to run unit tests

Created on 3 May 2019  Â·  3Comments  Â·  Source: horovod/horovod

Could you advice me how to properly run unit tests from horovod/test/?

For now I just use something like python -m -v unittest test_torch.py
Is this a correct way?
How do I specify GPU/CPU, number of workers, etc?

question

Most helpful comment

Horovod has unit tests for all frameworks you can run from the tests
directory:

cd /horovod/test && (echo test_*.py | xargs -n 1

JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk-10.0.2.jdk/Contents/Home
mpirun -np 2 pytest -v --capture=no)

Note: You will need the lates pyspark and Java 10 on OS x to run the spark
tests.

On Thu, May 2, 2019, 6:33 PM Sergii Dymchenko notifications@github.com
wrote:

Could you advice me how to properly run unit tests from horovod/test/?

For now I just use something like python -m -v unittest test_torch.py
Is this a correct way?
How do I specify GPU/CPU, number of workers, etc?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_horovod_horovod_issues_1046&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=Jcn0WXylthWxLPEr38PPjw&m=h-RL1gXZaqdem_QeTJMnGO7HtZzHR7yMB3dqWkmcUTI&s=-x8_Okq5PWiAgreHdUi3Ed6PdQkFkpRZcMY2DVTBC2o&e=,
or mute the thread
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAW4JGWJMANQ4YP27HVRDE3PTOI7PANCNFSM4HKPAWCQ&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=Jcn0WXylthWxLPEr38PPjw&m=h-RL1gXZaqdem_QeTJMnGO7HtZzHR7yMB3dqWkmcUTI&s=ZZ0BIC3u7KDX9ERlByYtH_vFSEM5Zrr1-rm27Eb-HBk&e=
.

All 3 comments

Horovod has unit tests for all frameworks you can run from the tests
directory:

cd /horovod/test && (echo test_*.py | xargs -n 1

JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk-10.0.2.jdk/Contents/Home
mpirun -np 2 pytest -v --capture=no)

Note: You will need the lates pyspark and Java 10 on OS x to run the spark
tests.

On Thu, May 2, 2019, 6:33 PM Sergii Dymchenko notifications@github.com
wrote:

Could you advice me how to properly run unit tests from horovod/test/?

For now I just use something like python -m -v unittest test_torch.py
Is this a correct way?
How do I specify GPU/CPU, number of workers, etc?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_horovod_horovod_issues_1046&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=Jcn0WXylthWxLPEr38PPjw&m=h-RL1gXZaqdem_QeTJMnGO7HtZzHR7yMB3dqWkmcUTI&s=-x8_Okq5PWiAgreHdUi3Ed6PdQkFkpRZcMY2DVTBC2o&e=,
or mute the thread
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAW4JGWJMANQ4YP27HVRDE3PTOI7PANCNFSM4HKPAWCQ&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=Jcn0WXylthWxLPEr38PPjw&m=h-RL1gXZaqdem_QeTJMnGO7HtZzHR7yMB3dqWkmcUTI&s=ZZ0BIC3u7KDX9ERlByYtH_vFSEM5Zrr1-rm27Eb-HBk&e=
.

I'm only interested in PyTorch tests now, so I run like this: mpirun -np 2 pytest -v --capture=no test_torch.py

There are some failures and it hangs on test_horovod_allreduce_type_error:

~/repos/clean/horovod/test$ mpirun -np 2 pytest -v --capture=no test_torch.py
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /home/sergii/anaconda3/envs/horovod-clean/bin/python
cachedir: .pytest_cache
rootdir: /home/sergii/repos/clean/horovod
collecting ... ============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /home/sergii/anaconda3/envs/horovod-clean/bin/python
cachedir: .pytest_cache
rootdir: /home/sergii/repos/clean/horovod
collected 33 items

collected 33 items

test_torch.py::TorchTests::test_broadcast_state PASSEDPASSED
test_torch.py::TorchTests::test_broadcast_state_options
test_torch.py::TorchTests::test_broadcast_state_options PASSEDPASSED
test_torch.py::TorchTests::test_compression_fp16
test_torch.py::TorchTests::test_compression_fp16 PASSEDPASSED
test_torch.py::TorchTests::test_duplicate_names
test_torch.py::TorchTests::test_duplicate_names PASSEDPASSED
test_torch.py::TorchTests::test_dynamic_requires_grad
test_torch.py::TorchTests::test_dynamic_requires_grad PASSEDPASSED
test_torch.py::TorchTests::test_force_allreduce
test_torch.py::TorchTests::test_force_allreduce PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather
test_torch.py::TorchTests::test_horovod_allgather PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error
test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_error
test_torch.py::TorchTests::test_horovod_allgather_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_grad
test_torch.py::TorchTests::test_horovod_allgather_grad PASSED
test_torch.py::TorchTests::test_horovod_allgather_type_error PASSED
test_torch.py::TorchTests::test_horovod_allgather_type_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_variable_size
test_torch.py::TorchTests::test_horovod_allgather_variable_size PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce
test_torch.py::TorchTests::test_horovod_allreduce PASSEDPASSED

test_torch.py::TorchTests::test_horovod_allreduce_async_fused test_torch.py::TorchTests::test_horovod_allreduce_async_fused PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_average
test_torch.py::TorchTests::test_horovod_allreduce_average PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_cpu_gpu_error
test_torch.py::TorchTests::test_horovod_allreduce_cpu_gpu_error FAILED
test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error FAILED
test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_grad
test_torch.py::TorchTests::test_horovod_allreduce_grad PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_grad_average
test_torch.py::TorchTests::test_horovod_allreduce_grad_average PASSED
test_torch.py::TorchTests::test_horovod_allreduce_inplace PASSED
test_torch.py::TorchTests::test_horovod_allreduce_inplace PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_multi_gpu
test_torch.py::TorchTests::test_horovod_allreduce_multi_gpu FAILED
test_torch.py::TorchTests::test_horovod_allreduce_type_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast FAILED
test_torch.py::TorchTests::test_horovod_allreduce_type_error [2019-05-03 16:45:18.779126: W horovod/common/operations.cc:764] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ranks:
0: [broadcast.noname.1352]
1: [allreduce.noname.1352]
[2019-05-03 16:46:18.784121: W horovod/common/operations.cc:764] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ranks:
0: [broadcast.noname.1352]
1: [allreduce.noname.1352]
[2019-05-03 16:47:18.785487: W horovod/common/operations.cc:764] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ranks:
0: [broadcast.noname.1352]
1: [allreduce.noname.1352]

If I run without mpirun, no hangs and one different test failure:

$ pytest -v --capture=no test_torch.py
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /home/sergii/anaconda3/envs/horovod-clean/bin/python
cachedir: .pytest_cache
rootdir: /home/sergii/repos/clean/horovod
collected 33 items

test_torch.py::TorchTests::test_broadcast_state PASSED
test_torch.py::TorchTests::test_broadcast_state_options PASSED
test_torch.py::TorchTests::test_compression_fp16 PASSED
test_torch.py::TorchTests::test_duplicate_names PASSED
test_torch.py::TorchTests::test_dynamic_requires_grad PASSED
test_torch.py::TorchTests::test_force_allreduce PASSED
test_torch.py::TorchTests::test_horovod_allgather PASSED
test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_allgather_error PASSED
test_torch.py::TorchTests::test_horovod_allgather_grad FAILED
test_torch.py::TorchTests::test_horovod_allgather_type_error PASSED
test_torch.py::TorchTests::test_horovod_allgather_variable_size PASSED
test_torch.py::TorchTests::test_horovod_allreduce PASSED
test_torch.py::TorchTests::test_horovod_allreduce_async_fused PASSED
test_torch.py::TorchTests::test_horovod_allreduce_average PASSED
test_torch.py::TorchTests::test_horovod_allreduce_cpu_gpu_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_grad PASSED
test_torch.py::TorchTests::test_horovod_allreduce_grad_average PASSED
test_torch.py::TorchTests::test_horovod_allreduce_inplace PASSED
test_torch.py::TorchTests::test_horovod_allreduce_multi_gpu PASSED
test_torch.py::TorchTests::test_horovod_allreduce_type_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast PASSED
test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_grad PASSED
test_torch.py::TorchTests::test_horovod_broadcast_inplace PASSED
test_torch.py::TorchTests::test_horovod_broadcast_rank_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_type_error PASSED
test_torch.py::TorchTests::test_horovod_rank PASSED
test_torch.py::TorchTests::test_horovod_size PASSED
test_torch.py::TorchTests::test_model_parallelism PASSED

================================================================================================================= FAILURES ==================================================================================================================
__________________________________________________________________________________________________ TorchTests.test_horovod_allgather_grad ___________________________________________________________________________________________________

self = <test_torch.TorchTests testMethod=test_horovod_allgather_grad>

    def test_horovod_allgather_grad(self):
        """Test the correctness of the allgather gradient."""
        hvd.init()
        rank = hvd.rank()
        size = hvd.size()

        # Only Tensors of floating point dtype can require gradients
        dtypes = [torch.FloatTensor, torch.DoubleTensor]
        if torch.cuda.is_available():
            dtypes += [torch.cuda.FloatTensor, torch.cuda.DoubleTensor]
            if _fp16_supported:
                dtypes += [torch.cuda.HalfTensor]
        dims = [1, 2, 3]
        for dtype, dim in itertools.product(dtypes, dims):
            # Support tests up to MPI Size of 35
            if size > 35:
                break

            tensor_sizes = [3, 2, 7, 4, 6, 8, 10] * 5
            tensor_sizes = tensor_sizes[:size]

            tensor = torch.FloatTensor(
                *([tensor_sizes[rank]] + [17] * (dim - 1))).fill_(1).mul_(rank)
            tensor = self.cast_and_place(tensor, dtype)
            tensor.requires_grad_()

            grad_list = []
            for r, size in enumerate(tensor_sizes):
                grad_list.append(self.cast_and_place(
                    torch.ones([size] + [17] * (dim - 1)), dtype) * r)
            grad_ys = torch.cat(grad_list, dim=0)

            gathered = hvd.allgather(tensor)
>           gathered.backward(grad_ys)

test_torch.py:612:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../anaconda3/envs/horovod-clean/lib/python3.7/site-packages/torch/tensor.py:107: in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

tensors = (tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0....    [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
       grad_fn=<HorovodAllgatherBackward>),)
grad_tensors = (tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0...., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]),)
retain_graph = False, create_graph = False, grad_variables = None

    def backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None):
        r"""Computes the sum of gradients of given tensors w.r.t. graph leaves.

        The graph is differentiated using the chain rule. If any of ``tensors``
        are non-scalar (i.e. their data has more than one element) and require
        gradient, then the Jacobian-vector product would be computed, in this
        case the function additionally requires specifying ``grad_tensors``.
        It should be a sequence of matching length, that contains the "vector"
        in the Jacobian-vector product, usually the gradient of the differentiated
        function w.r.t. corresponding tensors (``None`` is an acceptable value for
        all tensors that don't need gradient tensors).

        This function accumulates gradients in the leaves - you might need to zero
        them before calling it.

        Arguments:
            tensors (sequence of Tensor): Tensors of which the derivative will be
                computed.
            grad_tensors (sequence of (Tensor or None)): The "vector" in the Jacobian-vector
                product, usually gradients w.r.t. each element of corresponding tensors.
                None values can be specified for scalar Tensors or ones that don't require
                grad. If a None value would be acceptable for all grad_tensors, then this
                argument is optional.
            retain_graph (bool, optional): If ``False``, the graph used to compute the grad
                will be freed. Note that in nearly all cases setting this option to ``True``
                is not needed and often can be worked around in a much more efficient
                way. Defaults to the value of ``create_graph``.
            create_graph (bool, optional): If ``True``, graph of the derivative will
                be constructed, allowing to compute higher order derivative products.
                Defaults to ``False``.
        """
        if grad_variables is not None:
            warnings.warn("'grad_variables' is deprecated. Use 'grad_tensors' instead.")
            if grad_tensors is None:
                grad_tensors = grad_variables
            else:
                raise RuntimeError("'grad_tensors' and 'grad_variables' (deprecated) "
                                   "arguments both passed to backward(). Please only "
                                   "use 'grad_tensors'.")

        tensors = (tensors,) if isinstance(tensors, torch.Tensor) else tuple(tensors)

        if grad_tensors is None:
            grad_tensors = [None] * len(tensors)
        elif isinstance(grad_tensors, torch.Tensor):
            grad_tensors = [grad_tensors]
        else:
            grad_tensors = list(grad_tensors)

        grad_tensors = _make_grads(tensors, grad_tensors)
        if retain_graph is None:
            retain_graph = create_graph

        Variable._execution_engine.run_backward(
            tensors, grad_tensors, retain_graph, create_graph,
>           allow_unreachable=True)  # allow_unreachable flag
E       RuntimeError: invalid gradient at index 0 - got [12, 17] but expected shape compatible with [3, 17]

../../../../anaconda3/envs/horovod-clean/lib/python3.7/site-packages/torch/autograd/__init__.py:93: RuntimeError
============================================================================================================= warnings summary ==============================================================================================================
test/test_torch.py::TorchTests::test_broadcast_state
  /home/sergii/repos/clean/horovod/test/test_torch.py:822: DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0, use inspect.signature() or inspect.getfullargspec()
    if k in inspect.getargspec(cls.__init__).args

test/test_torch.py::TorchTests::test_broadcast_state
test/test_torch.py::TorchTests::test_broadcast_state_options
  /home/sergii/anaconda3/envs/horovod-clean/lib/python3.7/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
    warnings.warn(warning.format(ret))

test/test_torch.py::TorchTests::test_broadcast_state
  /home/sergii/anaconda3/envs/horovod-clean/lib/python3.7/site-packages/horovod/torch/__init__.py:279: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    if isinstance(x, collections.Iterable):

test/test_torch.py::TorchTests::test_broadcast_state_options
  /home/sergii/repos/clean/horovod/test/test_torch.py:968: DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0, use inspect.signature() or inspect.getfullargspec()
    if k in inspect.getargspec(opt_class.__init__).args

-- Docs: https://docs.pytest.org/en/latest/warnings.html
============================================================================================== 1 failed, 32 passed, 5 warnings in 9.24 seconds ==============================================================================================

Do you know what's going on?
This is on Ubuntu Linux machine with 2 (slightly different) GPUs, PyTorch 1.1, horovod from GitHub master.

I've found the reason for most failed tests, and the hang was caused by the bad state after previous test failure. I'll create separate issues for individual test problems.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kangp3 picture kangp3  Â·  3Comments

UditGupta10 picture UditGupta10  Â·  3Comments

Jongchan picture Jongchan  Â·  3Comments

zeyu-hello picture zeyu-hello  Â·  3Comments

goswamig picture goswamig  Â·  3Comments