Cylc-flow: Fail to create temporary file if TMPDIR does not exist

Created on 13 Mar 2019  路  15Comments  路  Source: cylc/cylc-flow

A related problem to #2976.

Some HPC environment sets up a unique TMPDIR for users when they login. This normally gets cleaned up on logout.

When the workflow server (i.e. suite server program) starts up and daemonizes, however, the double fork may trigger a premature clean up process of TMPDIR ( - suspected without much evidence). This means that TMPDIR is still defined, but points to a non-existent location. This causes any subsequent attempts to call mkstemp (and related, e.g. Python's tempfile.TemporaryFile) to fail.

Possible solutions:

  • Always call if os.getenv('TMPDIR'): os.makedirs(os.getenv('TMPDIR'), exist_ok=True) before creating a temporary file.
  • On workflow server (i.e. suite) start up, ignore environment's TMPDIR and set up a temporary directory under the suite run directory and point TMPDIR to it.

@cylc/core Any preference or other suggestions?

bug non-cylc bug

All 15 comments

The bug, of course, is that this can cause job submission to fail (or worse stuck in the ready state).

Wow, we sure have to deal with some esoteric problems sometimes 馃槵

I like your idea of a special temp dir under the suite run directory. That's simple, foolproof, and easy to monitor space requirements in the unlikely event that ever becomes a problem.

I prefer the second option

On workflow server (i.e. suite) start up, ignore environment's TMPDIR and set up a temporary directory under the suite run directory and point TMPDIR to it.

But perhaps an option that allows users to choose whether to use the default TMPDIR or this other folder managed by Cylc. We can update the installation instructions (and I would review the Docker containers to see whether it's better to use this option for easy access to Cylc files in a container).

And long shot @matthewrmshin , is there any remote chance that during task reload, the temporary directory is removed, causing tasks to be stuck in ready, and contributing to #2964 ? Or is it a too contrived case?

Unlikely. Given that we only use temporary file as STDIN to remote commands:

  • Transfer service files as a tar archive via SSH's STDIN before first job submission to a remote job host.
  • Transfer contents of job files to a remote job host via SSH's STDIN.

And the reason for using a temporary file is such that the SSH command cannot get stuck with a blocked STDIN pipe. (But with #2932, we are polling the pipes of STDOUT and STDERR, perhaps we should get rid of temporary file usage in the job submission context and just feed STDIN using os.write.)

The mktemp system call, of course, has its own convention around TMPDIR with a default to /tmp. The behaviour has been exploited by system admin and users. If TMPDIR is set, it is likely that site admin wants to override the default /tmp behaviour, or user wants to override the site admin's default. But as we have seen in #2622, even bash has got it wrong at one point - and led us all into trouble!

(And whoever said Python has batteries included?)

* On workflow server (i.e. suite) start up, ignore environment's `TMPDIR` and set up a temporary directory under the suite run directory and point `TMPDIR` to it.

Why set up a temporary directory under the suite run directory? It's unlikely this is the most appropriate location for temporary files. Can we make it configurable (noting that the most appropriate location may well be a different environment variable, e.g. $SCRATCH)?

I think the proposed approach is somewhat similar to what is being asked about by @trwhitcomb in https://groups.google.com/forum/#!topic/cylc/KoFhCGurLTo - although he was asking about a new environment variable (e.g. CYLC_TASK_TEMP_DIR) instead of overriding TMPDIR.

Why set up a temporary directory under the suite run directory? It's unlikely this is the most appropriate location for temporary files.

I don't know if that really matters if we have well defined and limited use of temporary files (as opposed to an unpredictable free-for all like /tmp).

If not strictly necessary, another configurable item is another thing that users need to know about...

Possible compromise idea, inspired by https://github.com/metomi/rose/pull/2297 ...

... use $HOME/cylc-run/SUITE/tmp but allow rose suite-run to symlink that (if needed) to a configurable other location.

Note the latest proposed solution means that we should rely only on in-memory temporary files. If we still need to do more, we can revisit this debate.

Ah, yeah, I forgot about that!

We can also raise a bug report to Python to document the logical issue with tempfile.gettempdir. (A quick look at the logic at https://github.com/python/cpython/blob/68d228f174ceed151200e7e0b44ffc5edd4e0ea2/Lib/tempfile.py#L277 suggests that the problem is a deliberate feature - so it only needs to calculate the root location for creating temporary files once.) I am not yet able to find a similar issue on Python's bug tracker.

Oh, I think it would be the first Python bug reported from an issue in Cylc?! :nerd_face: Or maybe drop a message to their chat - https://www.python.org/community/irc/

Was this page helpful?
0 / 5 - 0 ratings