5x Speedup on CICD via Github Action's Strategy.Matrix · Han Xiao Tech Blog

Background

No one likes to wait for 40 minutes on testing a pull request. You may make a coffee, flirt with your cat, read memes on the first few times, but I’m pretty sure your work ethic will eventually catch up: this isn’t right, and we need to optimize it now!

At Jina AI, we were facing this issue. As an open-source company, we have all our code infrastructures on Github and rely on Github Actions a lot (we even published two Actions in the marketplace). Our CICD workflows for jina-ai/jina have consisted of 15 files with almost 1-KLOC. We employ pytest and docker for conducting unit, integration, and regression tests. Starting from Feb. 2020, the community and our team continuously add tests while developing Jina, and today we have reached 85% code-coverage on 13-KLOC of Python code. However, this comes with a cost: on every commit to a pull request we have to wait for 40 minutes for its test result. It greatly slows down our development speed and prolongs the community’s feedback cycle. As a fast-growing OSS startup, this is unacceptable to us.

This post will show you how to use strategy.matrix and Github Packages to reduce the time on Github workflows significantly. For us, these tricks manage to cut our testing time from 40 minutes to 8 minutes! You can find out the complete script in our repository here.

Understanding the Test Structure
Building Job Matrix for Parallelization
Improvement on Testbed: Docker Build
- Docker Push to Github Packages
- Code Coverage
Summary

Jina is an easier way for enterprises and developers to build cross-/multi-modal neural search systems on the cloud. You can use Jina to build a text/image/video/audio search system in minutes. Give it a try:

Understanding the Test Structure

Initially, our Github workflow ci.yml looks like the following:

ci.yml Visualization

on:
  pull_request:

jobs:
  commit-lint:
    steps:
      # ...

  unit-test:
    needs: commit-lint
    strategy:
      matrix:
        python-version: [3.7, 3.8]

    steps:
      # ...
      - name: Test with pytest
        run: |
          # ...
          pytest --force-flaky --min-passes 1 --max-runs 5 --cov=jina --cov-report=xml -n 1 --timeout=120 -v tests/unit

  integration-test:
    needs: commit-lint
    strategy:
      matrix:
        python-version: [3.7, 3.8]

    steps:
      # ...
      - name: Test with pytest
        run: |
          # ...
          pytest --force-flaky --min-passes 1 --max-runs 5 --cov=jina --cov-report=xml -n 1 --timeout=120 -v --ignore-glob='tests/integration/hub_usage/dummyhub*' tests/integration

  codecov:
    needs: [unit-test, integration-test]
    steps:
      # ...

One can observe that the workflow already runs the unit and integration tests in parallel. Apparently, this is not enough. We need further parallelization on a finer granularity. But first, let’s look at the structure of our tests folder:

Folder	Purpose	Structure	Size
`tests/unit`	Unit test on each module	Sub-folders are organized by modules’ structure	Big
`tests/integration`	Integration & regression test on each module	Sub-folders are organized by modules’ structure/Github issues number/test scenarios	Big
`tests/distributed`	Integration test on distributed environment using `docker-compose`	Sub-folders are organized by network topologies	Medium
`tests/jinad`	Unit test on Jina Daemon	Subfolders are organized by modules’ structure	Small
`tests/jinahub`	Integration test on Jina Hub	No sub-folder	Small

Given that all tests listed above are written in pytest, and they are unevenly distributed over different folders, one may think of leveraging pytest plugins (e.g. pytest-parallel, pytest-xdist) to parallelize it per test-case. However, this method does not leverage the scalable Github workflow infra it offers. After all, you are still running everything on one machine. Besides, we also prefer a controllable and clean environment on every running to ensure zero side effects on each test. The zero side effect is vital to Jina, as Jina is a decentralized system that creates & destroys ports/sockets/workspaces while running. Too many tests running parallel on the same namespace may introduce flakiness into the CI.

Building Job Matrix for Parallelization

The strategy.matrix is a powerful syntax in Github workflow. It allows you to create multiple jobs by performing variable substitution in a single job definition. For example, you can use a matrix to create jobs for Python 3.7, 3.8 & 3.9 (as shown in the above). Github action will reuse the job’s configuration and create three jobs running in parallel. A job matrix can currently generate a maximum of 256 jobs per workflow run, which is far beyond enough in our case.

In a nutshell, we want to fill in strategy.matrix as follows:

core-test:
    # ...
    strategy:
      matrix:
        python-version: [3.7, 3.8]
        test-path: 
          - tests/integration/multimodal/
          - tests/integration/optimizers/
          - tests/integration/ref_indexer/
          - tests/integration/sharding/
          - tests/unit/clients/
          - tests/unit/docker/
          - tests/unit/drivers/
          - # all subfolders here ...
    steps:
      # ...
      run: pytest ${{ matrix.test-path }}

Of course, no one wants to hardcode these test paths into the workflow YAML; we need a better way to dynamically fill in the values into strategy.matrix.test-path. Two problems here:

How to get those test paths?
This can be done by listing all sub-folders and orphan tests in a simple shell script

# get orphan tests under 1st level folders
declare -a array1=( "tests/unit/*.py" "tests/integration/*.py" "tests/distributed/*.py" )
# get all 2nd-level sub-folders, exclude python cache
declare -a array2=( $(ls -d tests/{unit,integration,distributed}/*/ | grep -v '__pycache__' ))  
# combine these two and put them into an array
dest1=( "${array1[@]}" "${array2[@]}" )

How to pass the value to strategy.matrix.test-path?
Thanks to this feature introduced in April 2020, one can now use fromJSON to take a stringified JSON object and bind it to a property. Combining this with job.outputs we can build a workflow that has a fully dynamic matrix.

We need to adapt the shell script above and convert our shell array into a stringified JSON using our good old friend jq: a JSON parser in command-line:

1	printf '%s\n' "${dest[@]}" \| jq -R . \| jq -cs .

This will give you something as follows:

["tests/unit/*.py","tests/integration/*.py","tests/distributed/*.py","tests/distributed/test_against_external_daemon/","tests/distributed/test_index_query/","tests/distributed/test_index_query_with_shards/",...]

Putting everything together, we first prepare all test paths in a separate job say prep-testbed, and then in the testing job we load the paths from the output of prep-testbed into strategy.matrix. Consequently, our new workflow YAML can now be written as follows:

commit-lint:
  #...

prep-testbed:
  needs: commit-lint
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v2
    - id: set-matrix
      run: |
        sudo apt-get install jq
        echo "::set-output name=matrix::$(bash scripts/get-all-test-paths.sh)"
  outputs:
    matrix: ${{ steps.set-matrix.outputs.matrix }}

core-test:
  needs: prep-testbed
  runs-on: ubuntu-latest
  strategy:
    matrix:
      python-version: [3.7]
      test-path: ${{fromJson(needs.prep-testbed.outputs.matrix)}}
  steps:
    # ...
    run: pytest ${{ matrix.test-path }}

Note the job dependency of core-test on prep-testbed specified by needs keyword. Once commit your new workflow, Github will start creating many parallel jobs as follows:

Improvement on Testbed: Docker Build

One immediate benefit of the parallel workflow is that you can quickly locate the first failed test without running over all tests. If you still prefer an exhaustive test, add fail-fast: false to strategy.

Docker Push to Github Packages

Some of our tests require a pre-built Jina Docker image running as a detached container in the background. One can easily add this step to the prep-testbed job. The following YAML config build the image based on the current head and then upload it to Github Package. The core-test job then pulls this image before conducting any test. Here we choose Github Package over Docker Hub to leverage the faster delivery network inside Github:

prep-testbed:
  # ...
  steps:
    - uses: actions/checkout@v2
    - name: Upload to Github Docker Registry
      uses: elgohr/[email protected]
      with:
        name: jina-ai/jina/jina
        username: ${{ secrets.DOCKER_USERNAME }}
        password: ${{ secrets.DOCKER_PASSWORD }}
        registry: docker.pkg.github.com
        dockerfile: Dockerfiles/pip.Dockerfile
        buildargs: PIP_TAG
        tags: test-pip${{env.GITHUB_RUN_ID}}
    - id: set-matrix
      run: |
        sudo apt-get install jq
        echo "::set-output name=matrix::$(bash scripts/get-all-test-paths.sh)"
  outputs:
    matrix: ${{ steps.set-matrix.outputs.matrix }}

core-test:
  # ...
  needs: prep-testbed
  strategy:
    matrix:
      test-path: ${{fromJson(needs.prep-testbed.outputs.matrix)}}
  steps:
    - uses: actions/checkout@v2
    - name: Prepare enviroment
      run: |
        docker login docker.pkg.github.com -u $GITHUB_ACTOR -p $GITHUB_TOKEN
        docker pull docker.pkg.github.com/jina-ai/jina/jina:test-pip${{env.GITHUB_RUN_ID}}
        docker tag docker.pkg.github.com/jina-ai/jina/jina:test-pip${{env.GITHUB_RUN_ID}} jinaai/jina:test-pip
    - name: Run test
      run: pytest ${{ matrix.test-path }}

Note that we name the image with a unique environmental variable env.GITHUB_RUN_ID in the building time. This ensures that tests from different PRs are using their corresponding Docker images. In the core-test job after pulling the image, we immediately re-tag the image as jinaai/jina:test-pip before conducting any test. Hence, all test code can refer to the image as jinaai/jina:test-pip without specifying this unique ID.

Code Coverage

Careful readers may worry about the code coverage when using parallel workflow: if each job only tests against a small subset of the tests, how can they compute the code coverage correctly?

Fortunately, Codecov does not override report data for multiple uploads; it always merges the report data. It is common to test multiple build systems, split up tests in different containers, and group tests based on test focus.

core-test:
  # ...
  strategy:
    matrix:
      test-path: ${{fromJson(needs.prep-testbed.outputs.matrix)}}
  steps:
    # ...
    - name: Test
      run: |
        pytest --suppress-no-test-exit-code --cov=jina --cov-report=xml ${{ matrix.test-path } 
    - name: Upload coverage from test to Codecov
      uses: codecov/codecov-action@v1
      with:
        file: coverage.xml
        name: ${{ matrix.test-path }}-codecov
        fail_ci_if_error: false

The workflow above uploads the code coverage report after each test group is done. As some test groups finish earlier whereas others are slower, you will see the coverage report updates incrementally under the PR as follows:

Summary

In this post, I have demonstrated how to use strategy.matrix to improve the Github workflow speed. Optimization like this is ranked as the highest priority at Jina AI: getting it done will improve the whole engineering team’s efficiency, saving hundreds of minutes per day.

Note that in a parallel workflow like this, the overall test time equals the slowest group’s test time, which means if you put everything into one folder, making its workload significantly bigger than the other, then the benefit of parallelization is marginalized. However, uneven test workload is inevitable in practice, and let’s say simple automation can only bring us so far. Therefore, some best practice needs to be followed:

spread your test workload evenly across folders;
for a big test folder, split it into sub-folders based on test focus.

If you’d like to explore more ML/AIOps techniques in practice, welcome to join our monthly Engineering All Hands via Zoom or Youtube live stream. Previous meeting recordings can be found on our Youtube channel. If you like Jina and want to join us as a full-time AI / Backend / Frontend developer, please submit your CV to our job portal. Let’s build the next neural search ecosystem together!

5x Speedup on CICD via Github Action's `Strategy.Matrix`