Could you advice me how to properly run unit tests from horovod/test/?
For now I just use something like python -m -v unittest test_torch.py
Is this a correct way?
How do I specify GPU/CPU, number of workers, etc?
Horovod has unit tests for all frameworks you can run from the tests
directory:
cd /horovod/test && (echo test_*.py | xargs -n 1
JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk-10.0.2.jdk/Contents/Home
mpirun -np 2 pytest -v --capture=no)
Note: You will need the lates pyspark and Java 10 on OS x to run the spark
tests.
On Thu, May 2, 2019, 6:33 PM Sergii Dymchenko notifications@github.com
wrote:
Could you advice me how to properly run unit tests from horovod/test/?
For now I just use something like python -m -v unittest test_torch.py
Is this a correct way?
How do I specify GPU/CPU, number of workers, etc?—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_horovod_horovod_issues_1046&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=Jcn0WXylthWxLPEr38PPjw&m=h-RL1gXZaqdem_QeTJMnGO7HtZzHR7yMB3dqWkmcUTI&s=-x8_Okq5PWiAgreHdUi3Ed6PdQkFkpRZcMY2DVTBC2o&e=,
or mute the thread
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAW4JGWJMANQ4YP27HVRDE3PTOI7PANCNFSM4HKPAWCQ&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=Jcn0WXylthWxLPEr38PPjw&m=h-RL1gXZaqdem_QeTJMnGO7HtZzHR7yMB3dqWkmcUTI&s=ZZ0BIC3u7KDX9ERlByYtH_vFSEM5Zrr1-rm27Eb-HBk&e=
.
I'm only interested in PyTorch tests now, so I run like this: mpirun -np 2 pytest -v --capture=no test_torch.py
There are some failures and it hangs on test_horovod_allreduce_type_error:
~/repos/clean/horovod/test$ mpirun -np 2 pytest -v --capture=no test_torch.py
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /home/sergii/anaconda3/envs/horovod-clean/bin/python
cachedir: .pytest_cache
rootdir: /home/sergii/repos/clean/horovod
collecting ... ============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /home/sergii/anaconda3/envs/horovod-clean/bin/python
cachedir: .pytest_cache
rootdir: /home/sergii/repos/clean/horovod
collected 33 items
collected 33 items
test_torch.py::TorchTests::test_broadcast_state PASSEDPASSED
test_torch.py::TorchTests::test_broadcast_state_options
test_torch.py::TorchTests::test_broadcast_state_options PASSEDPASSED
test_torch.py::TorchTests::test_compression_fp16
test_torch.py::TorchTests::test_compression_fp16 PASSEDPASSED
test_torch.py::TorchTests::test_duplicate_names
test_torch.py::TorchTests::test_duplicate_names PASSEDPASSED
test_torch.py::TorchTests::test_dynamic_requires_grad
test_torch.py::TorchTests::test_dynamic_requires_grad PASSEDPASSED
test_torch.py::TorchTests::test_force_allreduce
test_torch.py::TorchTests::test_force_allreduce PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather
test_torch.py::TorchTests::test_horovod_allgather PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error
test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_error
test_torch.py::TorchTests::test_horovod_allgather_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_grad
test_torch.py::TorchTests::test_horovod_allgather_grad PASSED
test_torch.py::TorchTests::test_horovod_allgather_type_error PASSED
test_torch.py::TorchTests::test_horovod_allgather_type_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_variable_size
test_torch.py::TorchTests::test_horovod_allgather_variable_size PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce
test_torch.py::TorchTests::test_horovod_allreduce PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_async_fused test_torch.py::TorchTests::test_horovod_allreduce_async_fused PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_average
test_torch.py::TorchTests::test_horovod_allreduce_average PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_cpu_gpu_error
test_torch.py::TorchTests::test_horovod_allreduce_cpu_gpu_error FAILED
test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error FAILED
test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_grad
test_torch.py::TorchTests::test_horovod_allreduce_grad PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_grad_average
test_torch.py::TorchTests::test_horovod_allreduce_grad_average PASSED
test_torch.py::TorchTests::test_horovod_allreduce_inplace PASSED
test_torch.py::TorchTests::test_horovod_allreduce_inplace PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allreduce_multi_gpu
test_torch.py::TorchTests::test_horovod_allreduce_multi_gpu FAILED
test_torch.py::TorchTests::test_horovod_allreduce_type_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast FAILED
test_torch.py::TorchTests::test_horovod_allreduce_type_error [2019-05-03 16:45:18.779126: W horovod/common/operations.cc:764] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ranks:
0: [broadcast.noname.1352]
1: [allreduce.noname.1352]
[2019-05-03 16:46:18.784121: W horovod/common/operations.cc:764] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ranks:
0: [broadcast.noname.1352]
1: [allreduce.noname.1352]
[2019-05-03 16:47:18.785487: W horovod/common/operations.cc:764] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ranks:
0: [broadcast.noname.1352]
1: [allreduce.noname.1352]
If I run without mpirun, no hangs and one different test failure:
$ pytest -v --capture=no test_torch.py
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /home/sergii/anaconda3/envs/horovod-clean/bin/python
cachedir: .pytest_cache
rootdir: /home/sergii/repos/clean/horovod
collected 33 items
test_torch.py::TorchTests::test_broadcast_state PASSED
test_torch.py::TorchTests::test_broadcast_state_options PASSED
test_torch.py::TorchTests::test_compression_fp16 PASSED
test_torch.py::TorchTests::test_duplicate_names PASSED
test_torch.py::TorchTests::test_dynamic_requires_grad PASSED
test_torch.py::TorchTests::test_force_allreduce PASSED
test_torch.py::TorchTests::test_horovod_allgather PASSED
test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_allgather_error PASSED
test_torch.py::TorchTests::test_horovod_allgather_grad FAILED
test_torch.py::TorchTests::test_horovod_allgather_type_error PASSED
test_torch.py::TorchTests::test_horovod_allgather_variable_size PASSED
test_torch.py::TorchTests::test_horovod_allreduce PASSED
test_torch.py::TorchTests::test_horovod_allreduce_async_fused PASSED
test_torch.py::TorchTests::test_horovod_allreduce_average PASSED
test_torch.py::TorchTests::test_horovod_allreduce_cpu_gpu_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_grad PASSED
test_torch.py::TorchTests::test_horovod_allreduce_grad_average PASSED
test_torch.py::TorchTests::test_horovod_allreduce_inplace PASSED
test_torch.py::TorchTests::test_horovod_allreduce_multi_gpu PASSED
test_torch.py::TorchTests::test_horovod_allreduce_type_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast PASSED
test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_grad PASSED
test_torch.py::TorchTests::test_horovod_broadcast_inplace PASSED
test_torch.py::TorchTests::test_horovod_broadcast_rank_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_type_error PASSED
test_torch.py::TorchTests::test_horovod_rank PASSED
test_torch.py::TorchTests::test_horovod_size PASSED
test_torch.py::TorchTests::test_model_parallelism PASSED
================================================================================================================= FAILURES ==================================================================================================================
__________________________________________________________________________________________________ TorchTests.test_horovod_allgather_grad ___________________________________________________________________________________________________
self = <test_torch.TorchTests testMethod=test_horovod_allgather_grad>
def test_horovod_allgather_grad(self):
"""Test the correctness of the allgather gradient."""
hvd.init()
rank = hvd.rank()
size = hvd.size()
# Only Tensors of floating point dtype can require gradients
dtypes = [torch.FloatTensor, torch.DoubleTensor]
if torch.cuda.is_available():
dtypes += [torch.cuda.FloatTensor, torch.cuda.DoubleTensor]
if _fp16_supported:
dtypes += [torch.cuda.HalfTensor]
dims = [1, 2, 3]
for dtype, dim in itertools.product(dtypes, dims):
# Support tests up to MPI Size of 35
if size > 35:
break
tensor_sizes = [3, 2, 7, 4, 6, 8, 10] * 5
tensor_sizes = tensor_sizes[:size]
tensor = torch.FloatTensor(
*([tensor_sizes[rank]] + [17] * (dim - 1))).fill_(1).mul_(rank)
tensor = self.cast_and_place(tensor, dtype)
tensor.requires_grad_()
grad_list = []
for r, size in enumerate(tensor_sizes):
grad_list.append(self.cast_and_place(
torch.ones([size] + [17] * (dim - 1)), dtype) * r)
grad_ys = torch.cat(grad_list, dim=0)
gathered = hvd.allgather(tensor)
> gathered.backward(grad_ys)
test_torch.py:612:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../anaconda3/envs/horovod-clean/lib/python3.7/site-packages/torch/tensor.py:107: in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tensors = (tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.... [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
grad_fn=<HorovodAllgatherBackward>),)
grad_tensors = (tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0...., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]),)
retain_graph = False, create_graph = False, grad_variables = None
def backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None):
r"""Computes the sum of gradients of given tensors w.r.t. graph leaves.
The graph is differentiated using the chain rule. If any of ``tensors``
are non-scalar (i.e. their data has more than one element) and require
gradient, then the Jacobian-vector product would be computed, in this
case the function additionally requires specifying ``grad_tensors``.
It should be a sequence of matching length, that contains the "vector"
in the Jacobian-vector product, usually the gradient of the differentiated
function w.r.t. corresponding tensors (``None`` is an acceptable value for
all tensors that don't need gradient tensors).
This function accumulates gradients in the leaves - you might need to zero
them before calling it.
Arguments:
tensors (sequence of Tensor): Tensors of which the derivative will be
computed.
grad_tensors (sequence of (Tensor or None)): The "vector" in the Jacobian-vector
product, usually gradients w.r.t. each element of corresponding tensors.
None values can be specified for scalar Tensors or ones that don't require
grad. If a None value would be acceptable for all grad_tensors, then this
argument is optional.
retain_graph (bool, optional): If ``False``, the graph used to compute the grad
will be freed. Note that in nearly all cases setting this option to ``True``
is not needed and often can be worked around in a much more efficient
way. Defaults to the value of ``create_graph``.
create_graph (bool, optional): If ``True``, graph of the derivative will
be constructed, allowing to compute higher order derivative products.
Defaults to ``False``.
"""
if grad_variables is not None:
warnings.warn("'grad_variables' is deprecated. Use 'grad_tensors' instead.")
if grad_tensors is None:
grad_tensors = grad_variables
else:
raise RuntimeError("'grad_tensors' and 'grad_variables' (deprecated) "
"arguments both passed to backward(). Please only "
"use 'grad_tensors'.")
tensors = (tensors,) if isinstance(tensors, torch.Tensor) else tuple(tensors)
if grad_tensors is None:
grad_tensors = [None] * len(tensors)
elif isinstance(grad_tensors, torch.Tensor):
grad_tensors = [grad_tensors]
else:
grad_tensors = list(grad_tensors)
grad_tensors = _make_grads(tensors, grad_tensors)
if retain_graph is None:
retain_graph = create_graph
Variable._execution_engine.run_backward(
tensors, grad_tensors, retain_graph, create_graph,
> allow_unreachable=True) # allow_unreachable flag
E RuntimeError: invalid gradient at index 0 - got [12, 17] but expected shape compatible with [3, 17]
../../../../anaconda3/envs/horovod-clean/lib/python3.7/site-packages/torch/autograd/__init__.py:93: RuntimeError
============================================================================================================= warnings summary ==============================================================================================================
test/test_torch.py::TorchTests::test_broadcast_state
/home/sergii/repos/clean/horovod/test/test_torch.py:822: DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0, use inspect.signature() or inspect.getfullargspec()
if k in inspect.getargspec(cls.__init__).args
test/test_torch.py::TorchTests::test_broadcast_state
test/test_torch.py::TorchTests::test_broadcast_state_options
/home/sergii/anaconda3/envs/horovod-clean/lib/python3.7/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
test/test_torch.py::TorchTests::test_broadcast_state
/home/sergii/anaconda3/envs/horovod-clean/lib/python3.7/site-packages/horovod/torch/__init__.py:279: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
if isinstance(x, collections.Iterable):
test/test_torch.py::TorchTests::test_broadcast_state_options
/home/sergii/repos/clean/horovod/test/test_torch.py:968: DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0, use inspect.signature() or inspect.getfullargspec()
if k in inspect.getargspec(opt_class.__init__).args
-- Docs: https://docs.pytest.org/en/latest/warnings.html
============================================================================================== 1 failed, 32 passed, 5 warnings in 9.24 seconds ==============================================================================================
Do you know what's going on?
This is on Ubuntu Linux machine with 2 (slightly different) GPUs, PyTorch 1.1, horovod from GitHub master.
I've found the reason for most failed tests, and the hang was caused by the bad state after previous test failure. I'll create separate issues for individual test problems.
Most helpful comment
Horovod has unit tests for all frameworks you can run from the tests
directory:
JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk-10.0.2.jdk/Contents/Home
mpirun -np 2 pytest -v --capture=no)
Note: You will need the lates pyspark and Java 10 on OS x to run the spark
tests.
On Thu, May 2, 2019, 6:33 PM Sergii Dymchenko notifications@github.com
wrote: