poetry add/update/install can fail on Windows during temp directory cleanup

Created on 12 Apr 2019  路  15Comments  路  Source: python-poetry/poetry

If something goes wrong during an installation, a file is often held open for just a little too long for TemporaryDirectory causing it to error out with errors such as:

[OSError]
[WinError 145] The directory is not empty: 'C:\\Users\\KING~1.KYL\\AppData\\Local\\Temp\\tmp774m6ge1'

...

 C:\Users\king.kyle\AppData\Local\Continuum\anaconda2\envs\py37\lib\contextlib.py in __exit__() at line 119
   next(self.gen)
 C:\Users\king.kyle\.poetry\lib\poetry\utils\helpers.py in temporary_directory() at line 35
   yield name
 C:\Users\king.kyle\AppData\Local\Continuum\anaconda2\envs\py37\lib\tempfile.py in __exit__() at line 805
   self.cleanup()
 C:\Users\king.kyle\AppData\Local\Continuum\anaconda2\envs\py37\lib\tempfile.py in cleanup() at line 809
   _shutil.rmtree(self.name)
 C:\Users\king.kyle\AppData\Local\Continuum\anaconda2\envs\py37\lib\shutil.py in rmtree() at line 507
   return _rmtree_unsafe(path, onerror)
 C:\Users\king.kyle\AppData\Local\Continuum\anaconda2\envs\py37\lib\shutil.py in _rmtree_unsafe() at line 395
   onerror(os.rmdir, path, sys.exc_info())
 C:\Users\king.kyle\AppData\Local\Continuum\anaconda2\envs\py37\lib\shutil.py in _rmtree_unsafe() at line 393
   os.rmdir(path)

add [-D|--dev] [--git GIT] [--path PATH] [-E|--extras EXTRAS] [--optional] [--python PYTHON] [--platform PLATFORM] [--allow-prereleases] [--dry-run] [--] <name> (<name>)...

This is an incredibly unhelpful error and masks all sorts of other problems.

It is causes by a bug in shutil.rmtree which can fail on Windows from time to time. It can fail for a variety of reasons, but very commonly it's simply a process holding the file active for just a little too long resulting in os.rmdir failing. This is despite the directory actually being empty when you check (i.e. it's just slow).

This is a known error in Python, of which there is no plan to fix: https://bugs.python.org/issue29982

As a result, I suggest the avoidance of using TemporaryDirectory, over which we have no control of the cleanup, and instead use mkdtemp manually, followed by a short loop to try and remove the directory.

In particular in helpers.py:

@contextmanager
def temporary_directory(*args, **kwargs):
    try:
        from tempfile import TemporaryDirectory

        with TemporaryDirectory(*args, **kwargs) as name:
            yield name
    except ImportError:
        name = tempfile.mkdtemp(*args, **kwargs)

        yield name

        shutil.rmtree(name)

should instead be something like:

@contextmanager
def temporary_directory(*args, **kwargs):
    name = tempfile.mkdtemp(*args, **kwargs)

    yield name

    robust_rmtree(name)

def robust_rmtree(path, max_retries=6):
    """Robustly tries to delete paths.

    Retries several times if an OSError occurs.
    If the final attempt fails, the Exception is propagated
    to the caller.
    """
    for i in range(max_retries - 1):
        try:
            shutil.rmtree(path)
            return # Only hits this on success
        except OSError:
            time.sleep(1)

    # Final attempt, pass any Exceptions up to caller.
    shutil.rmtree(path)

This is done in a few other projects:

The latter is where I ripped the example function.

I'm willing to write a pull request later to test this myself. But I'm writing the issue now as it is a problem that should be dealt with.

Bug

Most helpful comment

Similar code is also included in cpython itself! In particular, while they have agreed to not include this into base Python, they do still need it for their own Windows tests.

For reference: Associated bug ticket: https://bugs.python.org/issue15496 - commit referencing it: https://github.com/python/cpython/commit/6f5c5cb75b2fddf31cbb311d2a5508c54f6e21f6 (via Blame feature on GitHub)

All 15 comments

The max_retries=6 (secs) without feedbacks is too long for a failed command to me.
Please add a error message above the sleep(1)

First, I should state that the example I presented in the issue was copy pasted is not my actual pull request. It was just an example thrown up for discussion. You should reference my pull request if we're talking about the technical details of the implementation.


This isn't actually an error though, and is a fundamental method of dealing with directory deletion in Windows.

So outputting an error doesn't make sense. At most it would be a debug message. I would point out though, that this is 6 seconds until catastrophic failure. If you reach the 6 seconds you will have poetry crash with an error. So it's not like it's just a slow command, it's a slow command followed by a crash which is noticeably more acceptable in practice.

Further, I don't actually know how to output a message from this function in a way that is consistent with poetry, But it's not like there aren't other long running commands that provide very little feedback during their execution (creating a virtual environment is one of these)

The pull request is 3 attempts at 2 seconds each, but this is up for discussion. I do not know how much time is needed to get a near guarantee of directory deletion on Windows, but there is a decent chance we can run it for only 3 attempts at 1 second per.


I want to stress though. This fix or something along these lines is a requirement to claim that Poetry supports Windows, as currently in certain situations and setups it will just flat out fail as a result of this problem.


Side note: I updated the pull request to match coding standards. It was such a simple change I thought I surely couldn't mess it up. But nope: Two spaces before in-line comments...

Would you like to use this method to find out which program kept the files?

https://youtu.be/PDab6s7cWqk

P.S If this issue keeps you from trying poetry, it will keeps you from trying poetry forever.

I don't think you recognize the problem here. When you mark a file for deletion in Windows, it does not delete immediately. It is held in stasis until all references to the file clear, and then at some arbitrary point in the future it is removed.

rmtree does not wait for this to occur. Which means, from time to time, it will just arbitrarily fail. This can be simply from timing (nested directories almost always cause a problem), or from antivirus taking 0.3 seconds too long to scan the file on access.

As a result, it will error when it tries to delete the parent directory while the files within it are waiting on removal. These will be removed very quickly, within seconds, generally less than 1 second, but this errors regardless.

This is a well known problem with Windows. Many other projects have faced this issue and implement a delayed deletion to deal with it. There is a reason that Poetry gives Directory not empty errors in Windows when completely unrelated problems occur, and it's because of this delayed deletion within Windows.


There is no sensible way for me to use Unlocker to determine what is holding it open for the 100ms it's grabbed for. It could simply be the Windows file system!

This is a Windows issue. It is unrelated to my PC in particular, but the windows ecosystem on a whole. The speed of your system can be enough to make something succeed or fail. The speed of your HDD makes a difference. There is no sensible way to force or test for this, you just must know that the problem exists and deal with it.

But my point seems not that odd: https://stackoverflow.com/questions/24699854/strange-error-when-deleting-directory-from-java-0-bytes-access-denied

You made me search file descriptors without context in poetry and pip for a while...

Just to be clear. This is not a problem involving a regular program actually locking a file open for some reason. It is the result of a program responding to a change notification (specifically, the delete notification)

This is basically like having a callback in code:

def on_delete(file):
    # Do some stuff
    print(f"Do some stuff with {file.name}")

That file is in the process of being deleted. DeleteFile returned successful. It's just that some programs are grabbing information from it before this occurs. In particular, this can be the indexing service from Windows itself.

Poetry then fails from trying to delete the parent directory before this occurs. Python was told "yep, that file will be deleted.... at some point", but poetry assumes that means "right now", which means when it tires to remove the parent directory the file is not yet gone and we get the error. The file is removed mere seconds later, but by then it's too late. Poetry has crashed.


So that's why you need a retry loop. Either that or you catch the failure, ignore the error, and cleanup tmp files at a later time (because Windows doesn't do that).

Either way, something has to be done here to support this weird quirk of Windows.

@Pluckerpluck Your example made this a serious issue absolutely.
That probably conflicts with the newest Java's principle.

What's the evidence of the robustness about robust_rmtree()?

robust_rmtree() in its very nature can never guarantee removal of a directory. We have to have a timeout put in there somewhere. But it is the same system as used by:

And has been in both of those codebases for over 5 years.

As a "recent" example, rmtree() fails if a subfolder is opened in Explorer on windows due to a 0.8 millisecond wait on the directory. So in the vast majority of cases I expect this to clean up within a second.

Similar code is also included in cpython itself! In particular, while they have agreed to not include this into base Python, they do still need it for their own Windows tests.

Note: They wait for less than 1 second though, starting at 0.001 seconds and increasing exponentially up to a maximum of 1. So maybe the initial wait of 2 seconds it needlessly large, however given that this is running during poetry install and rarely actually occurs (generally at maximum on a single dependency), I think the time is OK.


With regards to testing, we used to have a single PC that consistently failed due to this error, and this change means that the system now installs requirements properly. We now have been using this fix for some time and have yet to run into any issues.

The problem is more of an issue with Windows buildbots though, where you can't just "retry" the install in the same way you would on a PC.

Please minimize the interval and the required runs of the loop.

The name of the function had better be double_rmtree() to reserve the leading "r" for rampage_rmtree().

Do you agree it's good to use the same system as CPython? Start at 1ms wait, and exponentially scale to 1 second wait (this means the total possible wait is < 2 seconds). If Python themselves do it, then it must be reasonable.

I don't understand the issue with the name roboust_rmtree() it's a much clearer name than doubel_rmtree() (which kind of sounds like you're removing something exactly twice) and I don't understand why we care about reserving the leading r. Especially given that rmtree already begins with an r and that I have no idea what rampage_rmtree() is...

If it works with your method for that reason (always), it will only need to run twice exactly.
Please rename it or close it.

There is no guarantee it will need only two runs! Python themselves run the check up to 10 times.

This is what CPython does:

  • Wait 1ms
  • Wait 2ms
  • Wait 4ms
  • Wait 8ms
  • Wait 16ms
  • Wait 32ms
  • Wait 64ms
  • Wait 128ms
  • Wait 256ms
  • Wait 512ms
  • Report failure

This results in a maximum wait of 1 second, but a minimum wait of 1ms. The idea being that you will avoid the Windows issue in the least amount of time needed.

I suggest we implement the system that Python themselves use. There is probably a good reason they choose to do it that way.

Checking only twice means we have to catch the worst case, which means waiting for 1 second every time. While rare that's just a waste of time for no good reason.


Again. Why can't this be called robust_rmtree? Or something similar that explains what it does. The name should explain the functionality (it's a more robust rmtree), not the exact implementation.

Similar code is also included in cpython itself! In particular, while they have agreed to not include this into base Python, they do still need it for their own Windows tests.

For reference: Associated bug ticket: https://bugs.python.org/issue15496 - commit referencing it: https://github.com/python/cpython/commit/6f5c5cb75b2fddf31cbb311d2a5508c54f6e21f6 (via Blame feature on GitHub)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This bug still happens, however it looks like #1032 is abandoned.

Was this page helpful?
0 / 5 - 0 ratings