Virtual-environments: Github Actions randomly fail with "Error: No space left on device"

Created on 20 Oct 2020  Â·  9Comments  Â·  Source: actions/virtual-environments

Description
Github Actions randomly fail with Error: No space left on device

Area for Triage:

Bug:

Virtual environments affected

  • [ ] macOS 10.15
  • [ ] Ubuntu 16.04 LTS
  • [x] Ubuntu 18.04 LTS
  • [ ] Ubuntu 20.04 LTS
  • [ ] Windows Server 2016 R2
  • [ ] Windows Server 2019

Expected behavior
The jobs have to pass.

Actual behavior
Jobs randomly fails despite usual disk space usage.
I have added sudo df -h after pulling a docker image but before building the project.
After the project is built, the build directory is another ~2.5GB.

Filesystem      Size  Used Avail Use% Mounted on
udev            3.4G     0  3.4G   0% /dev
tmpfs           696M  680K  695M   1% /run
/dev/sda1        84G   65G   19G  78% /
tmpfs           3.4G  8.0K  3.4G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
/dev/sda15      105M  3.6M  101M   4% /boot/efi
/dev/sdb1        14G  4.1G  9.0G  32% /mnt

As you may see, the disk space usage is less than 14GB available for the runners.

Note:

  1. Usually, after the job is re-triggered the problem does not occur.

Repro steps

  1. The link to the failing workflow: https://github.com/vmware/concord-bft/blob/master/.github/workflows/build_and_test.yml
  2. Unfortunately, the failing job was accidentally restarted so I don't have the logs.
Image administration Ubuntu investigate

Most helpful comment

@f-squirrel actually the build consumes more than 2.5Gb, I've forked the repo and change the step to output disk space every 15 seconds:

        - name: Build and test
          run: |
              (while true; do 
              df -h
              sleep 15
              done) &
              script -q -e -c "make pull"
              sudo df -h
              script -q -e -c "make build \
                              ${{ matrix.compiler}} \
                              CONCORD_BFT_CMAKE_FLAGS=\"\
                              ${{ matrix.ci_build_type }} \
                              -DBUILD_TESTING=ON \
                              -DBUILD_COMM_TCP_PLAIN=FALSE \
                              -DBUILD_COMM_TCP_TLS=FALSE \
                              -DCMAKE_CXX_FLAGS_RELEASE=-O3 -g \
                              -DUSE_LOG4CPP=TRUE \
                              -DBUILD_ROCKSDB_STORAGE=TRUE \
                              ${{ matrix.use_s3_obj_store }} \
                              -DUSE_OPENTRACING=ON \
                              -DOMIT_TEST_OUTPUT=OFF\
                              -DKEEP_APOLLO_LOGS=TRUE\" "\
              && script -q -e -c "make test"

At the end of the build, I had 12Gb free.

Filesystem      Size  Used Avail Use% Mounted on
udev            3.4G     0  3.4G   0% /dev
tmpfs           696M  756K  695M   1% /run
/dev/sda1        84G   73G   12G  87% /
tmpfs           3.4G  8.0K  3.4G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
/dev/sda15      105M  3.6M  101M   4% /boot/efi
/dev/sdb1        14G  4.1G  9.0G  32% /mnt

I wonder if some core dumps were created during your failed run because I saw this step in your yaml:

        - name: Configure core dump location
          run: |
            echo '/cores/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
            mkdir -p ${{ github.workspace }}/artifact/cores/

All 9 comments

Hi @f-squirrel!
As far as I can see from your snippet there is 19Gb free on /, is it not enough for your needs? Please note that /mnt used for swap therefore it has less than 14Gb.
image

@miketimofeev , 19GB is more than enough for me, because the only additional space I need is 2.4GB for the build artifacts.
I have tried to print the disk space usage after the problem happens but it did not work because the runner itself fails.
Could you advise how to debug this issue?
Another question is were 65GB usage comes from? I do not install anything on the machine except one docker image 1.64GB.

@f-squirrel let me answer from the end — 65GB is used for Ubuntu itself + a huge list of preinstalled software, which can be found here https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu1804-README.md
Could you please share your workflow and failed run so we can check what else could consume all the remaining disk space?

@f-squirrel actually the build consumes more than 2.5Gb, I've forked the repo and change the step to output disk space every 15 seconds:

        - name: Build and test
          run: |
              (while true; do 
              df -h
              sleep 15
              done) &
              script -q -e -c "make pull"
              sudo df -h
              script -q -e -c "make build \
                              ${{ matrix.compiler}} \
                              CONCORD_BFT_CMAKE_FLAGS=\"\
                              ${{ matrix.ci_build_type }} \
                              -DBUILD_TESTING=ON \
                              -DBUILD_COMM_TCP_PLAIN=FALSE \
                              -DBUILD_COMM_TCP_TLS=FALSE \
                              -DCMAKE_CXX_FLAGS_RELEASE=-O3 -g \
                              -DUSE_LOG4CPP=TRUE \
                              -DBUILD_ROCKSDB_STORAGE=TRUE \
                              ${{ matrix.use_s3_obj_store }} \
                              -DUSE_OPENTRACING=ON \
                              -DOMIT_TEST_OUTPUT=OFF\
                              -DKEEP_APOLLO_LOGS=TRUE\" "\
              && script -q -e -c "make test"

At the end of the build, I had 12Gb free.

Filesystem      Size  Used Avail Use% Mounted on
udev            3.4G     0  3.4G   0% /dev
tmpfs           696M  756K  695M   1% /run
/dev/sda1        84G   73G   12G  87% /
tmpfs           3.4G  8.0K  3.4G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
/dev/sda15      105M  3.6M  101M   4% /boot/efi
/dev/sdb1        14G  4.1G  9.0G  32% /mnt

I wonder if some core dumps were created during your failed run because I saw this step in your yaml:

        - name: Configure core dump location
          run: |
            echo '/cores/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
            mkdir -p ${{ github.workspace }}/artifact/cores/

@miketimofeev , good point!
I'll update my workflow with some additional prints, run a few tests, and get back with the results!
Thank you!

@f-squirrel did it help?

@miketimofeev , I haven't seen the issue again.
I'll continue monitoring!

@f-squirrel I'm going to close the issue for now, but please feel free to contact us if the problem still occurs.
Thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MarcDenman picture MarcDenman  Â·  32Comments

philipengberg picture philipengberg  Â·  37Comments

robertmclaws picture robertmclaws  Â·  35Comments

FrancescElies picture FrancescElies  Â·  32Comments

AlenaSviridenko picture AlenaSviridenko  Â·  68Comments