Deepspeech: Moving away from TaskCluster

Created on 9 Sep 2020  Â·  18Comments  Â·  Source: mozilla/DeepSpeech

TaskCluster is a CI service provided by Mozilla, and available to both Firefox development (Firefox-CI instance) and Community on Github (Community TaskCluster). It’s being widely used across some Mozilla projects, and it has its own advantages. In our case, the control over tasks, over workers for specific needs and long build time was easier to achieve working with the TaskCluster team rather than relying on other CI services.

However, this has lead to the CI code being very specific to the project, and kind of a source of frustration for non employees trying to send patches and get involved in the project ; specifically because some of the CI parts were “hand-crafted” and triggering builds and tests requires being a “collaborator” on the Github project, which has other implications making it complicated to enable it easily to anyone. In the end, this creates an artificial barrier to contributing to this project, even though we happily run PRs manually, it is still frustrating for everyone. The issue https://github.com/mozilla/DeepSpeech/issues/3228 was an attempt to fix that, but we came to the conclusion it would be more beneficial for everyone to switch to some well known CI service and setup that is less intimidating. While TaskCluster is a great tool and has helped us a lot, we feel its limitations now makes it inappropriate for the project to stimulate and enable external contributions.

We would like to take this opportunity to also enable more contributors to hack and own the code related to CI, so discussion is open.

ci help wanted

Most helpful comment

Our current usage of TaskCluster:

We leverage the current features:

  • building a graph of tasks with dependencies: https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/tc-decision.py
  • artifact with indexes: https://community-tc.services.mozilla.com/tasks/index/project.deepspeech
  • building multiple archs:

    • linux/amd64 (via docker-worker)

    • linux/aarch64 (cross-compilation, docker-worker)

    • linux/rpi3 (cross-compilation, docker-worker)

    • android/armv7 (cross-compilation, docker-worker)

    • android/aarch64 (cross-compilation, docker-worker)

    • macOS/amd64 (native, generic-worker, deepspeech-specific hardware deployment, generic-worker)

    • iOS/x86_64 (native, reusing the macOS infra)

    • iOS/aarch64 (native, reusing the macOS infra)

    • Windows/amd64 (native, generic-worker, deepspeech pool managed by taskcluster team)

  • testing on multiple archs:

    • linux/amd64 (docker-worker)

    • linux/aarch64 (native, deepspeech specific hardware, docker-worker)

    • linux/rpi3 (native, deepspeech specific hardware, docker-worker)

    • android/armv7 (docker-worker + nested virt)

    • android/aarch64 (docker-worker + nested virt)

    • macOS/amd64 (native, deepspeech specific hardware deployment, generic-worker)

    • iOS/x86_64 (native, reusing macOS infra)

    • Windows/amd64 (native, generic-worker, deepspeech pool managed by taskcluster team)

    • Windows/CUDA (native, generic-worker with NVIDIA GPU, deepspeech pool managed by taskcluster team)

  • Documentation on ReadTheDocs + Github webhook to generate on PR/push/tag
  • Pushing to repos:

    • Docker Hub via CircleCI

    • Everything else via scriptworker instance running on Heroku:

    • NPM

    • Pypi

    • Nuget

    • JCenter

    • Github

Hardware:

  • Set of GCP VMs for Linux+Android builds/tests
  • Set of AWS VMs for Windows builds/tests
  • 4x MacBook Pro for macOS setups, with VMare Fusion and sets of builds/tests VMs configured
  • ARM hardware self-hosted:

    • 6x LePotato boards for Linux/Aarch64 tests

    • 6x RPi3 boards for Linux/ARMv7 tests

    • DSC_1401

tc-decision.py is in charge of building the whole graph of tasks describing a PR or a Push/Tag:

  • PRs runs tests
  • Push runs builds
  • Tag runs builds + uploads to repositories
  • YAML description files in taskcluster/*.yml to describe tasks
  • dependencies between tasks based on .yml filename (without .yml)
  • decision task created by .taskcluster.yml (canonical entry point of tasckluster / github integration) + taskcluster/tc-schedule.sh
  • https://community-tc.services.mozilla.com/docs
  • LC_ALL=C GITHUB_EVENT="pull_request.synchronize" TASK_ID="aa" GITHUB_HEAD_BRANCHORTAG="branchName" GITHUB_HEAD_REF="refs/heads/branchName" GITHUB_HEAD_BRANCH="branchName" GITHUB_HEAD_REPO_URL="aa" GITHUB_HEAD_SHA="a" GITHUB_HEAD_USER="a" GITHUB_HEAD_USER_EMAIL="a" python3 taskcluster/tc-decision.py --dry
  • LC_ALL=C GITHUB_EVENT="tag" TASK_ID="aa" GITHUB_HEAD_BRANCHORTAG="branchName" GITHUB_HEAD_REF="refs/heads/branchName" GITHUB_HEAD_BRANCH="branchName" GITHUB_HEAD_REPO_URL="aa" GITHUB_HEAD_SHA="a" GITHUB_HEAD_USER="a" GITHUB_HEAD_USER_EMAIL="a" python3 taskcluster/tc-decision.py --dry

Execution encapsulated within bash scripts:

  • Only bash for ease of hacking
  • Re-usable accross all platforms (Linux, macOS, Windows) whereas Docker would cover only Linux
  • TensorFlow build:

    • tf_tc-setup.sh : perform setup steps for TensorFlow builds (install Bazel, CUDA, etc.)

    • tf_tc-build.sh: perform build of TensorFlow

    • tf_tc-package.sh: package the TensorFlow build dir as home.tar.xz for re-use

    • exact re-use of tensorflow is required for Bazel to properly re-use its caching

  • DeepSpeech build

    • same architecture, span over:

    • taskcluster/tc-all-utils.sh

    • taskcluster/tc-all-vars.sh

    • taskcluster/tc-android-utils.sh

    • taskcluster/tc-asserts.sh

    • taskcluster/tc-build-utils.sh

    • taskcluster/tc-dotnet-utils.sh

    • taskcluster/tc-node-utils.sh

    • taskcluster/tc-package.sh

    • taskcluster/tc-py-utils.sh

All 18 comments

What do you think about GitLabs builtin CI features?

I'm using it for my Jaco-Assistant project and I'm quite happy with it because currently it supports almost all my requirements. The pipeline does linting checks and some code statistics calculation and I'm using it to provide prebuilt container images (You could build and provide the training images from there for example). See my CI setup file here.

There is also an official tutorial for usage with github: https://about.gitlab.com/solutions/github/
And its free for open source projects.

What do you think about GitLabs builtin CI features?

That would mean moving to gitlab, which raises other questions. I dont have experience with their ci even though i use gitlab for some personal project (from gitorious.org).

Maybe i should post a detailed explanation of our usage of taskcluster to help there ?

That would mean moving to gitlab

No, you can use it with github too.


From: https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/

Instead of moving your entire project to GitLab, you can connect your external repository to get the benefits of GitLab CI/CD.

Connecting an external repository will set up repository mirroring and create a lightweight project with issues, merge requests, wiki, and snippets disabled. These features can be re-enabled later.

To connect to an external repository:

    From your GitLab dashboard, click New project.
    Switch to the CI/CD for external repo tab.
    Choose GitHub or Repo by URL.
    The next steps are similar to the import flow. 


Maybe i should post a detailed explanation of our usage of taskcluster to help there ?

I think this is a good idea. But you should be able to do everything on gitlab ci as soon you can run it in a docker container without special flags.

in a docker container

We also need support for Windows, macOS and iOS that cannot be covered by Docker

Our current usage of TaskCluster:

We leverage the current features:

  • building a graph of tasks with dependencies: https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/tc-decision.py
  • artifact with indexes: https://community-tc.services.mozilla.com/tasks/index/project.deepspeech
  • building multiple archs:

    • linux/amd64 (via docker-worker)

    • linux/aarch64 (cross-compilation, docker-worker)

    • linux/rpi3 (cross-compilation, docker-worker)

    • android/armv7 (cross-compilation, docker-worker)

    • android/aarch64 (cross-compilation, docker-worker)

    • macOS/amd64 (native, generic-worker, deepspeech-specific hardware deployment, generic-worker)

    • iOS/x86_64 (native, reusing the macOS infra)

    • iOS/aarch64 (native, reusing the macOS infra)

    • Windows/amd64 (native, generic-worker, deepspeech pool managed by taskcluster team)

  • testing on multiple archs:

    • linux/amd64 (docker-worker)

    • linux/aarch64 (native, deepspeech specific hardware, docker-worker)

    • linux/rpi3 (native, deepspeech specific hardware, docker-worker)

    • android/armv7 (docker-worker + nested virt)

    • android/aarch64 (docker-worker + nested virt)

    • macOS/amd64 (native, deepspeech specific hardware deployment, generic-worker)

    • iOS/x86_64 (native, reusing macOS infra)

    • Windows/amd64 (native, generic-worker, deepspeech pool managed by taskcluster team)

    • Windows/CUDA (native, generic-worker with NVIDIA GPU, deepspeech pool managed by taskcluster team)

  • Documentation on ReadTheDocs + Github webhook to generate on PR/push/tag
  • Pushing to repos:

    • Docker Hub via CircleCI

    • Everything else via scriptworker instance running on Heroku:

    • NPM

    • Pypi

    • Nuget

    • JCenter

    • Github

Hardware:

  • Set of GCP VMs for Linux+Android builds/tests
  • Set of AWS VMs for Windows builds/tests
  • 4x MacBook Pro for macOS setups, with VMare Fusion and sets of builds/tests VMs configured
  • ARM hardware self-hosted:

    • 6x LePotato boards for Linux/Aarch64 tests

    • 6x RPi3 boards for Linux/ARMv7 tests

    • DSC_1401

tc-decision.py is in charge of building the whole graph of tasks describing a PR or a Push/Tag:

  • PRs runs tests
  • Push runs builds
  • Tag runs builds + uploads to repositories
  • YAML description files in taskcluster/*.yml to describe tasks
  • dependencies between tasks based on .yml filename (without .yml)
  • decision task created by .taskcluster.yml (canonical entry point of tasckluster / github integration) + taskcluster/tc-schedule.sh
  • https://community-tc.services.mozilla.com/docs
  • LC_ALL=C GITHUB_EVENT="pull_request.synchronize" TASK_ID="aa" GITHUB_HEAD_BRANCHORTAG="branchName" GITHUB_HEAD_REF="refs/heads/branchName" GITHUB_HEAD_BRANCH="branchName" GITHUB_HEAD_REPO_URL="aa" GITHUB_HEAD_SHA="a" GITHUB_HEAD_USER="a" GITHUB_HEAD_USER_EMAIL="a" python3 taskcluster/tc-decision.py --dry
  • LC_ALL=C GITHUB_EVENT="tag" TASK_ID="aa" GITHUB_HEAD_BRANCHORTAG="branchName" GITHUB_HEAD_REF="refs/heads/branchName" GITHUB_HEAD_BRANCH="branchName" GITHUB_HEAD_REPO_URL="aa" GITHUB_HEAD_SHA="a" GITHUB_HEAD_USER="a" GITHUB_HEAD_USER_EMAIL="a" python3 taskcluster/tc-decision.py --dry

Execution encapsulated within bash scripts:

  • Only bash for ease of hacking
  • Re-usable accross all platforms (Linux, macOS, Windows) whereas Docker would cover only Linux
  • TensorFlow build:

    • tf_tc-setup.sh : perform setup steps for TensorFlow builds (install Bazel, CUDA, etc.)

    • tf_tc-build.sh: perform build of TensorFlow

    • tf_tc-package.sh: package the TensorFlow build dir as home.tar.xz for re-use

    • exact re-use of tensorflow is required for Bazel to properly re-use its caching

  • DeepSpeech build

    • same architecture, span over:

    • taskcluster/tc-all-utils.sh

    • taskcluster/tc-all-vars.sh

    • taskcluster/tc-android-utils.sh

    • taskcluster/tc-asserts.sh

    • taskcluster/tc-build-utils.sh

    • taskcluster/tc-dotnet-utils.sh

    • taskcluster/tc-node-utils.sh

    • taskcluster/tc-package.sh

    • taskcluster/tc-py-utils.sh

I have been using GitLab CI (the on-prem community edition) for about three years at my workplace, and so far I have been very happy with it. @lissyx I believe GitLab CI supports all the requirements you listed above - I've personally used most of those features.

The thing I really like about GitLab CI is that it seems to be a very important feature for the company - they release updates frequently.

@lissyx I believe GitLab CI supports all the requirements you listed above - I've personally used most of those features.

Don't hesitate if you want, I'd be happy to see how you can do macOS or Windows builds / tests.

Windows builds might be covered with some of their beta features:
https://about.gitlab.com/blog/2020/01/21/windows-shared-runner-beta/

For iOS I think you would need to create your own runners on the macbooks and link them to the CI. They made a blog post for this:
https://about.gitlab.com/blog/2016/03/10/setting-up-gitlab-ci-for-ios-projects/

Windows builds might be covered with some of their beta features:
https://about.gitlab.com/blog/2020/01/21/windows-shared-runner-beta/

For iOS I think you would need to create your own runners on the macbooks and link them to the CI. They made a blog post for this:
https://about.gitlab.com/blog/2016/03/10/setting-up-gitlab-ci-for-ios-projects/

I have no time to take a look at that, sadly.

@DanBmh @opensorceror Let me be super-clear: what you shared looks very interesting, but I have no time to dig into that myself. If you guys are willing, please go ahead. One thing I should add is that for macOS, we would really need something to be hosted: the biggest pain was on maintaining this. If we move to GitLab CI but there is still need to babysit those, it's not really worth the effort.

Personally I'm a bit hesitant to work on this by myself, because the CI config of this repo seems too complex for a lone newcomer to tackle.

FWIW, I did a test connecting a GitHub repo with GitLab CI...works pretty well.

I'm not sure where we would find _hosted_ macOS options though.

Personally I'm a bit hesitant to work on this by myself, because the CI config of this repo seems too complex for a lone newcomer to tackle.

Of course

FWIW, I did a test connecting a GitHub repo with GitLab CI...works pretty well.

That's nice, i will have a look.

I'm not sure where we would find _hosted_ macOS options though.

That might be the biggest pain point.

Looks like Travis supports macOS builds.

Never used it though, not aware of the limitations if any.

FWIW, I did a test connecting a GitHub repo with GitLab CI...works pretty well.

Can it do something like we do with TC, i.e., precompile bits and fetch them at need?
This is super-important, because when you have to rebuild TensorFlow with CUDA, we're talking about hours even on decent systems.

So to overcome this, we have https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/generic_tc_caching-linux-opt-base.tyml + e.g., https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/tf_linux-amd64-cpu-opt.yml

It basically:

  • do a setup + bazel build step on tensorflow with the parameters we need
  • produce a tar we can re-use later
  • store it on taskcluster index infrastructure

Which allows us to have caching we can periodically update, as you can see there: https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/.shared.yml#L186-L260

We use the same mechanisms for many components (SWIG, pyenv, homebrew, etc.) to make sure we can keep build time decent on PRs (~10-20min of build more or less, ~2min for tests) so that a PR can complete under 30-60 mins.

That would be possible, it's also called _artifacts_ in gitlab. You should be able to run the job periodically or only if certain files did change in the repo.

I'm doing something similar here, saving the following image, which I later use in my readme.

That would be possible, it's also called _artifacts_ in gitlab. You should be able to run the job periodically or only if certain files did change in the repo.

I'm doing something similar here, saving the following image, which I later use in my readme.

Nice, and can those be indexed like what TaskCluster has?

an those be indexed like what TaskCluster has?

Not sure what you mean by this. You can give them custom names or save folders depending on your branch names for example, if this is what you mean.

an those be indexed like what TaskCluster has?

Not sure what you mean by this. You can give them custom names or save folders depending on your branch names for example, if this is what you mean.

Ok, I think I will try and use GitLab CI on gitlab for a pet-project of mine that lacks CI :), that will help me get a grasp of the landscape.

Was this page helpful?
0 / 5 - 0 ratings