Transformers: [testing] automatically clean up temp dirs during teardown

Created on 14 Aug 2020  路  20Comments  路  Source: huggingface/transformers

Recently, a crucial fix was applied to several tests https://github.com/huggingface/transformers/pull/6453/files. the temp dir wasn't getting cleaned and subsequent tests were unreliable as they were tapping into invalid old data. The remaining issue is that the added fix is not guaranteed to run. And it repeats itself many times.

I thought of several ways to ensure the removal of the temp dir and how to make it easier to use in the tests. Here are some ideas I came up with

Idea 1

Using a simple tempfile.TemporaryDirectory context manager:

from tempfile import TemporaryDirectory
class ExamplesTests(unittest.TestCase):

    def test_run_pl_glue(self):
        with TemporaryDirectory() as tmp_dir:
            testargs = f"""
                run_pl_glue.py
                --output_dir {tmp_dir.name}
           [...]

Pros:

  • generic code
  • very localized
  • can be done multiple times in the same test and having a fresh temp dir

Cons:

  • can't pass a specific fixed dir so that it's easier to debug - could write a custom context manager that supports both random and fixed path.
  • tricky to debug - if one wants the temp dir to not be removed while developing the test - could write a custom version that supports an argument that will not remove the temp dir
  • have to reach into the object to get the actual path with obj.name

Idea 1.5

This one solves the cons of idea 1

Write a custom context manager that takes a hard-coded path and a flag to clean up or not to make it easy to debug, to be built on top of tempfile.TemporaryDirectory.

I haven't written it yet. But the core should be similar to TestsWithTempDir shown in the next idea, plus a context manager.

But here is how it would be used:

from transformers.test_utils import temp_dir_ctx
class ExamplesTests(unittest.TestCase):

    def test_run_pl_glue(self):
        with temp_dir_ctx(cleanup=True, path=None) as tmp_dir:
            testargs = f"""
                run_pl_glue.py
                --output_dir {tmp_dir.name}
           [...]

So that we could have:

Most of the time with minimal extra code, which will use a random path and auto-delete:

        with temp_dir_ctx() as tmp_dir:
            do_something_with(tmp_dir.name)

If we want a specific tmp path:

        with temp_dir_ctx(path="/use/my/path") as tmp_dir:
            do_something_with(tmp_dir.name)

if we are debugging and don't want the auto-deletion

        with temp_dir_ctx(cleanup=False) as tmp_dir:
            do_something_with(tmp_dir.name)

the only remaining cons:

  • have to reach into the object to get the actual path with obj.name - can fix with def __str__(self): return self.name

Idea 2

Solve the problem on the test class level, so that the tests don't need to do anything at all to clean up the temp dir. This solution uses unittest.TestCase's setUp/tearDown fixtures.

from pathlib import Path
import tempfile
import shutil
class TestsWithTempDir(unittest.TestCase):
    """
    This class is for tests that need to automatically remove a temp dir at the end of the test
    regardless of its success or failure.

    if no `tmp_dir` is passed a unique temp dir is created. if it's passed that passed dir is used instead.

    In either case that path is created and `self.tmp_dir` is set to the path that was used.

    Example 1: Let the system choose the path

    class ExamplesTests(TestsWithTempDir):

        def test_run_something(self):
            print(f"{self.tmp_dir} will be removed at the end of the test")

    Example 2: Use the path I supply

    class ExamplesTests(TestsWithTempDir):

        def __init__():
            super.__init__(tmp_dir="./foo/bar")

        def test_run_something(self):
            print(f"{self.tmp_dir} will be removed at the end of the test")

    """
    def __init__(self, tmp_dir=None):
        self.tmp_dir = tmp_dir
        self.tmp_dir_obj = None

    def setUp(self):
        if self.tmp_dir:
            Path(self.tmp_dir).mkdir(parents=True, exist_ok=True)
        else:
            self.tmp_dir_obj = tempfile.TemporaryDirectory()
            self.tmp_dir = self.tmp_dir_obj.name

    def tearDown(self):
        if self.tmp_dir_obj:
            del self.tmp_dir_obj
        else:
            shutil.rmtree(self.tmp_dir, ignore_errors=True)

Pros:

  • moves the cleaning up responsibility away from the test, leaving the test focused to just what it tests
  • very flexible - can handle custom and random paths
  • debug should be relatively easy - just need to add another option or a method to not tear-down (I haven't implemented it yet)

Cons:

  • only supports one tmp dir per test - won't work if multiple executions happen in the same test
  • the action is far removed from the code - could be hard to see - I'm especially concerned with running shutil.rmtree at a distance - it'd be easy to make a mistake of passing /tmp/foo instead of ./tmp/foo or worse. I'd rather not use shutil.rmtree at all unless it's right there when the developer can see what they are removing.

After contemplating these different solutions, I feel that locality is more important than behind the scenes magic, so I feel the best solution would be Idea 1.5 - i.e. a custom context manager that makes it easy to debug, to be built on top of tempfile.TemporaryDirectory, and also supports a hardcoded tmp path.

Please, let me know if any of these resonate with you and then I can code a PR that can be seen in action.

Thank you!

All 20 comments

Our @JetRunner has already gone and proposed a fix using hashes: https://github.com/huggingface/transformers/pull/6475

I think #6475 makes sense but your proposals also resonate with me, especially the 1.5. Using tempfile.TemporaryDirectory seems cleaner to me than manually removing the folder afterwards. The hardcoded paths are already set-up thanks to @JetRunner's hashes, but it does make it harder to debug as hashes are not understandable from a human point of view.

@JetRunner, would love your input on @stas00's proposals!

Thanks @stas00 and @LysandreJik!
Both idea 1 and 1.5 look good to me! Idea 2 is not flexible enough and I am worried about using the same temp dir for all test cases (if my understanding is right). Maybe idea 1 is good enough and idea 1.5 seems to be a little over complicated since people can just quickly change the directory name from tmp_dir.name to their local path for debugging and then do some cleaning themselves.
Yes, I agree temporary_dir looks much better than rmtree. rmtree looks super scary.
Also I wonder how can you trace the temp dir if the test is interrupted? Will it still be cleaned?

Thank you for looking at my suggestions!

Our @JetRunner has already gone and proposed a fix using hashes: #6475

Neat! A few notes:

  • it hasn't solved the problem of guaranteed cleanup. if the test asserts half way the clean up will not be run.
  • I like that it ends up with the same dir name for a given test all the time
  • it doesn't tell me what that output_dir is, have to take extra steps to figure it out - e.g. ls -lt
  • it's a bit too much copy-n-paste - the scaffolding is starting to dominate the test. It can be made into a function in testing_utils and there is no need to manually push --output_dir into testargs, could just use f" .... {output_dir}" into the existing list of testargs

it does make it harder to debug as hashes are not understandable from a human point of view.

I concur. Though tempfile's output is cryptic too: /tmp/tmp0vpwv7ok

Idea 2 is not flexible enough and I am worried about using the same temp dir for all test cases (if my understanding is right).

Right. That approach is problematic if you have concurrent tests running with pytest -n 2+. Good observation! It could be easily fixed though by for example using the test name as a unique string or a hash of it.

While idea 2 is super-smooth - no changes to the test! It's too far removed from where things happen from the perspective of the developer working on the test.

Maybe idea 1 is good enough and idea 1.5 seems to be a little over complicated since people can just quickly change the directory name from tmp_dir.name to their local path for debugging and then do some cleaning themselves.

You will have to comment out the with line and re-indent the rest of the code (or replace with if 1:) if you want to switch to local path, since tempfile doesn't support such override - it's not debug-needs-friendly.

Yes, I agree temporary_dir looks much better than rmtree. rmtree looks super scary.

I'm glad we both find it scary

Also I wonder how can you trace the temp dir if the test is interrupted? Will it still be cleaned?

I'm not sure what you mean by 'trace'.

It does the right thing wrt guaranteed cleanup. Testing In ipython:

import tempfile
try:
    with tempfile.TemporaryDirectory() as tmp_dir:
         print(f"{tmp_dir} will be removed at the end of the test")
         !ls -l $tmp_dir
         assert False
except: 
    pass     
finally:
    !ls -l $tmp_dir
/tmp/tmp0vpwv7ok will be removed at the end of the test
total 0
ls: cannot access '/tmp/tmp0vpwv7ok': No such file or directory



md5-0bd7f1caebabefe63bad9420fc35a1ac



- with tempfile.TemporaryDirectory() as tmp_dir:
+ if 1:
      tmp_dir="./local/path"

will do the trick. hence the idea 1.5, which will do this for you. plus let you control whether to delete or not.


One more cons of pre-creating a temp dir, regardless of how it is done is that it'll lead to not testing script's capacity to correctly create a non-existing dir for its outputs.

Yes, I agree temporary_dir looks much better than rmtree. rmtree looks super scary.

I'm glad we both find it scary

If we end up using it in a context manager I wonder whether it'd be a smart idea to protect the developer from wiping out parts of their system, by refusing to delete that dir unless it was created by the context manager - i.e. it'll assert if the dir already exists. And, of course, provide a flag i_know_what_I_am_doing_dammit=True which will bypass the baby-gate.

I don't know. This isn't great either - it will interfere with testing - I just don't like rm -r happening anywhere where I don't explicitly see it, including what it's deleting.

I am okay with all these solutions and they have their own pros and cons!

For Idea 1.5, I still think if the user (i.e., developer in this case) wants to use their own directory, we should not handle the cleaning part. On one hand, cleaning may have a risk of deleting parts of the user's file system by mistake; on the other hand, I don't think it's a good idea to make this function too complicated.

Idea 2 LGTM too as long as you solve the contradiction in directory and rmtree is considered acceptable.

If we aren't cleaning up automatically the hardcoded path, then it defeats the purpose of 1.5 completely, i.e. best then to use 1.0 - i.e. use generic tempfile.TemporaryDirectory.

So we start using:

from tempfile import TemporaryDirectory
[...]
    with TemporaryDirectory() as tmp_dir:
         print(f"{tmp_dir} will be removed at the end of the test")

and the developer working on the test and wanting a fixed path, will have to re-write this with:

from tempfile import TemporaryDirectory
[...]

   # with TemporaryDirectory() as tmp_dir:
   if 1:
       tmp_dir="./local/path"
       print(f"{tmp_dir} will be removed at the end of the test")
   import shutil
   shutil.rmtree(tmp_dir, ignore_errors=True)

That's a lot to type :(

But with 1.5 we don't have to bother to reindent, right?

You mean, as in:

        with temp_dir_ctx() as tmp_dir:
            do_something_with(tmp_dir.name)

vs:

        with temp_dir_ctx(path="/use/my/path") as tmp_dir:
            do_something_with(tmp_dir.name)

no need to reindent indeed, but it'll be very confusing as it will behave differently if path is passed (no clean up)

Moreover, if we do tell the dev to use shutil.rmtree(tmp_dir, ignore_errors=True) we are back at square one - it won't be run if assert will happen before it, so the next test run will be "contaminated".

I was thinking that in this particular situation, we actually need to wipe the dir out before the test is run. i.e. this is the real need. It's much easier to ensure it happens, because we can do it first things first, so no assert to expect.

The after test clean up is a different need.

Fair enough! I don't really have a preference here so let's go with what you think makes the most sense!

It's clear that I want to have the cake and eat it too. I want a super-safe solution, yet, with minimal coding inside the test. I think that perhaps I have to choose one or the other. I just feel uncomfortable to take responsibility for creating a gun that someone might shoot their foot with (could be mine). If I were a robot my positronic brain would have melted right now.

Haha don't worry. All these solutions are better than what we have right now (#6475)

OK, I thought of something.

We use 1.5 as originally proposed in the first comment, but in addition we require that the hardcoded path is a subdir of the cwd dir . Assert if it is not. Then in the worst case scenario something unwanted will get wiped under the cloned git dir, but the system is safe.

I agree. And we can listen to the community when the PR is done.

Idea 2.5

class ExamplesTests(TestsWithTempDir):
[...]
    def test_whatever(self):
        tmp_dir = self.remove_at_teardown("./tmp/dir")
        # code whatever, and nothing else to write, no extra indent/scope needed

This will require subclassing unittest.TestCase, to facilitate registry of one or more dirs to clean up via a new method remove_at_teardown, and the clean up of those dirs will get run automatically via its def tearDown(self)method which will do all the work (needs to be written).

This is even simpler and solves most of the deficiencies of the previous ideas.

  • we still require a sub-dir for safety, will be validated at registry time.
  • this idea drops the use of temp dir as it's not user-friendly debug wise. So we go back to hardcoded paths.
  • it's flexible, you can add several tmp dirs to remove.
  • if you want to keep the dir, just comment out the registry call

if we want to ensure the dir is clean from the get-go, we can use another method that will attempt to delete at addition time and during teardown. self.remove_now_and_at_teardown or a flag remove_at_teardown(now=True).

Thoughts?

Cool. However, it is not practical to prevent others from copying and pasting the code fragment and the same path will be a problem for parallel testing (as we discussed). In this case, I believe you can use a hash (like #6475). However, temporary dir is still a cool idea that I don't want to give up. Good to hear from @sshleifer @sgugger

I agree. I will code something that will support both, so by default we will use a unique tmp dir but for debug it'll allow for a hardcoded path. I will send a PR soonish.

Thank you for the wonderful practical feedback, @JetRunner

Thanks for taking care of it! Closing this as resolved.

Was this page helpful?
0 / 5 - 0 ratings