Hello, I notice that espnet's PyTorch version uses data_parallel to support multi-GPU, but to my understanding, this can only be used in single-machine. Is there any plan to support multi-machine multi-GPU ASR training? So that making 10k+ hours training feasible in the industry. Thx.
We have been working internally with several projects and it's working with some minor changes, but I want to move in this direction once we fix a new training abstraction design (e.g., #1372). We're thinking of a (super) major update for this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue is closed. Please re-open if needed.
Most helpful comment
We have been working internally with several projects and it's working with some minor changes, but I want to move in this direction once we fix a new training abstraction design (e.g., #1372). We're thinking of a (super) major update for this.