Singularity: checkpointing with CRIU

Created on 27 Jan 2017  Â·  22Comments  Â·  Source: hpcng/singularity

A new feature has been request :

Including CRIU in singularity for a full freezing and restoring process.

It is already done in docker, LXD, OpenVZ. It can also be used for live migration.

This would be a great function.

Best regards,
Rémy

Enhancement Hacktoberfest help wanted

Most helpful comment

I fully agree with @olifre - this would give us mobility of compute in, literally, another dimension (time). Imagine starting a job on your puny laptop and then transferring it to a beefy machine when you arrive at the office. :-)

For scientific computing I see two main applications:

  • Running computations on dynamic clusters (e.g. including office machines which can join the cluster if idle and leave it again) via HTCondor and similar. So, as @olifre said, using short-lived opportunistic resources for long-running compute jobs.

    HTCondor, for example, can kill a job when the machine it's running on is used for something with higher priority, but can also restart it later (possibly on another machine) - iff a checkpointing mechanism is in place. HTCondor has become very popular in high-energy physics, for example - but a checkpointing mechanism is often lacking, because the software doesn't support it. Singularity with CRIU would be a very elegant solution - completely transparent to a whole (potentially very complex) software stack. Of course CRIU itself may fail to dump one of the many things involved, but it would be a start.

  • Guarding long-running calculations against machine failures, reboots, etc., without writing custom checkpointing code.

All 22 comments

I found this related CRIU mailing list discussion:
https://lists.openvz.org/pipermail/criu/2017-August/039133.html
This is about checkpointing a full container from "outside" with CRIU, which apparently has some issues as of now.

Snapshot/restore of Singularity containers via CRIU would be an awesome feature. :-)

I think there are too many things on this list https://criu.org/What_cannot_be_checkpointed#Cannot_be_dumped_.28yet.29 to make that easily do-able, but perhaps some limited portion could be checkpointed? What does a checkpoint provide over just having the singularity image itself?

Hi @vsoch ,
I fully agree making it work is not easy - CRIU is a project with a long history and is being used with Docker, LXC etc. It's far from working perfectly out of the box, but it's improving step by step.

What does a checkpoint provide over just having the singularity image itself?

The concepts are completely orthogonal.
A container image contains a runtime environment. Provided the container runtime (e.g. singularity) is installed on a site and resources are available, this allows to compute within the user defined environment. So I'd say having the image and singularity provides mobility of the possibility to compute.
The next step is to achieve real mobility of compute. For this, it is necessary to be able to pause / kill a running compute job, migrate it to another machine, and continue from the point where the calculation was stopped.

In VM terms, this would relate to "the VM image" (= the container image) and the possibility to take a VM-snapshot (including memory and state) to be able to perform (live or offline) migration.

Only with this capability, full mobility of compute is achieved and it becomes feasible to use short-lived opportunistic resources for long-running compute jobs.
Even on HPC farms, this would be good to have, to be able to preempt jobs to quickly free slots for jobs with high CPU count requirements.

So in short, checkpointing is orthogonal to having a container image, and a full checkpoint (nicest would be to be able to do that from outside the container) is required for full mobility of compute.

i under the distinction, thank you for the detail! If a running process is akin to time, then pausing it is akin to controlling time. And wow, that would impressive - I do hope we can get to some reality like that.

I fully agree with @olifre - this would give us mobility of compute in, literally, another dimension (time). Imagine starting a job on your puny laptop and then transferring it to a beefy machine when you arrive at the office. :-)

For scientific computing I see two main applications:

  • Running computations on dynamic clusters (e.g. including office machines which can join the cluster if idle and leave it again) via HTCondor and similar. So, as @olifre said, using short-lived opportunistic resources for long-running compute jobs.

    HTCondor, for example, can kill a job when the machine it's running on is used for something with higher priority, but can also restart it later (possibly on another machine) - iff a checkpointing mechanism is in place. HTCondor has become very popular in high-energy physics, for example - but a checkpointing mechanism is often lacking, because the software doesn't support it. Singularity with CRIU would be a very elegant solution - completely transparent to a whole (potentially very complex) software stack. Of course CRIU itself may fail to dump one of the many things involved, but it would be a start.

  • Guarding long-running calculations against machine failures, reboots, etc., without writing custom checkpointing code.

Hi,
How would you handle distributed applications e. g. using MPI? There you have the MPI state in addition to the application state. In worst case, a transmission could be on its way on the cable/switches in the moment of checkpointing. How to handle that?

Cheerio, Jan

Jan Wender - j.[email protected]

Am 22.11.2017 um 12:42 schrieb Oliver Schulz notifications@github.com:

I fully agree with @olifre - this would give us mobility of compute in, literally, another dimension (time). Imagine starting a job on your puny laptop and then transferring it to a beefy machine when you arrive at the office. :-)

For scientific computing I see two main applications:

Running computations on dynamic clusters (e.g. including office machines which can join the cluster if idle and leave it again) via HTCondor and similar. So, as @olifre said, using short-lived opportunistic resources for long-running compute jobs.

HTCondor, for example, can kill a job when the machine it's running on is used for something with higher priority, but can also restart it later (possibly on another machine) - iff a checkpointing mechanism is in place. HTCondor has become very popular in high-energy physics, for example - but a checkpointing mechanism is often lacking, because the software doesn't support it. Singularity with CRIU would be a very elegant solution - completely transparent to a whole (potentially very complex) software stack. Of course CRIU itself may fail to dump one of the many things involved, but it would be a start.

Guarding long-running calculations against machine failures, reboots, etc., without writing custom checkpointing code.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

One option is to stop the application (in sense of SIGSTOP), and then
drain the queues on all the ranks to reach quiescent network state. You
do not deliver messages to the application, but just keep them in such
an intermediate state.

Now you can checkpoint and restore the application, replay the recorded
messages and let the application run further.

This is how DMTCP, and some other mechanisms, perform Infiniband migration.

You would need some support from within the container. But OpenMPI, for
example, already has plugins for checkpoint/restart.

On 11/24/2017 01:47 PM, Jan Wender wrote:

Hi,
How would you handle distributed applications e. g. using MPI? There you
have the MPI state in addition to the application state. In worst case,
a transmission could be on its way on the cable/switches in the moment
of checkpointing. How to handle that?

Cheerio, Jan

Jan Wender - j.[email protected]

Am 22.11.2017 um 12:42 schrieb Oliver Schulz notifications@github.com:

I fully agree with @olifre - this would give us mobility of compute
in, literally, another dimension (time). Imagine starting a job on your
puny laptop and then transferring it to a beefy machine when you arrive
at the office. :-)

For scientific computing I see two main applications:

Running computations on dynamic clusters (e.g. including office
machines which can join the cluster if idle and leave it again) via
HTCondor and similar. So, as @olifre said, using short-lived
opportunistic resources for long-running compute jobs.

HTCondor, for example, can kill a job when the machine it's running
on is used for something with higher priority, but can also restart it
later (possibly on another machine) - iff a checkpointing mechanism is
in place. HTCondor has become very popular in high-energy physics, for
example - but a checkpointing mechanism is often lacking, because the
software doesn't support it. Singularity with CRIU would be a very
elegant solution - completely transparent to a whole (potentially very
complex) software stack. Of course CRIU itself may fail to dump one of
the many things involved, but it would be a start.

Guarding long-running calculations against machine failures, reboots,
etc., without writing custom checkpointing code.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/singularityware/singularity/issues/468#issuecomment-346821514,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHKZxqJzruwKJCiXpIMRyckXlPUx-7MEks5s5rr8gaJpZM4LviPx.

--
Regards,
Maksym Planeta

Is this still a desired feature? I would love to be able to do this and I may be able to spend some time trying to implement what is required to let CRIU checkpoint and restore a singularity container, but I'm not sure where to start from.

Heya @Maaarcocr! Yeah I think there is a lot of interest in this. But I don't think anyone is working on it right now. If you want to give something a go that would be amazing!

@GodloveD I'm fairly new to singularity, what would be the best way to approach this?

Hello,

I'm also interested in this feature. And would love to contribute,
especially if there is some mentoring from your side, because I'm new to
singularity.

On 01/19/2018 02:20 PM, David Godlove wrote:

Heya @Maaarcocr https://github.com/maaarcocr! Yeah I think there is a
lot of interest in this. But I don't think anyone is working on it right
now. If you want to give something a go that would be amazing!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/singularityware/singularity/issues/468#issuecomment-358964227,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHKZxh80nU4KcFC0yRMC1Ke6BjpfD_zKks5tMJaxgaJpZM4LviPx.

--
Regards,
Maksym Planeta

I'm also very interested in the feature, but currently do not have capacity to work on it.
I would say it's certainly best to start with testing CRIU in four configurations:

  1. Run CRIU inside of singularity, checkpointing the full process tree inside.
  2. Run singularity inside of CRIU, i.e. try to checkpoint full singularity process tree.

I would test these two configurations first with user namespaces, i.e. non-setuid, on a modern kernel, since there should be official support by CRIU, hence "4 configurations".
I expect 1. is easier to get to work, since CRIU should not have to do anything special. for singularity at all.

Then, I'd look at what exactly fails. Since CRIU should support every application, it might be best to understand the issues of CRIU first, and only in a second step (after attempting to fix things in CRIU / the kernel checkpointing / restore feature) see if Singularity can be adapted to make checkpointing easier / integrate with CRIU.
In any case, the first sensible steps don't require any knowledge about Singularity, I'd say, only how to use and configure it. The significantly larger issue is to understand checkpoint / restore, and what is missing there.

Chiming in as another person who is super interested in seeing this feature exist! I took a stab at it a while ago but had to give up because the kernel my university's cluster runs on is currently too old for CRIU to work (as far as I can tell, at least).

I'll just chime in, it would be an awesome feature!

As I submitted the issue, I am also very interested in it. However like many other stuffs I do not have much free time for it...

I found the following sentence:

Support for Checkpoint Restart: Internal support for checkpoint-restarting for mobility of state

on a slide about Singularity 3.0 from @bauerm97 shown at the CernVM Users Workshop ( https://indico.cern.ch/event/608592/contributions/2830120/attachments/1592403/2520972/CernVM_Workshop.pdf ). What's that about?

@olifre it's on the roadmap. The basic idea is that one of the data objects within the SIF format could save the state of the container when it is paused. Then you can move your container to a new environment and Singularity will know how to start it again.

I previously expressed an intention to help out with CRIU checkpointing. Unfortunatelly, I was terribly overwhelmed with other work and could not engage into work with CRIU. Now I have more time and eager to participate. I tried out some simple things, but CRIU does simple make a dump because of complicated structure of namespaces.

I believe my work will be more efficient if somebody can give me some guidance how to proceed. Would anybody volunteer?

I believe my work will be more efficient if somebody can give me some guidance how to proceed. Would anybody volunteer?

@planetA I can give you a small intro.

For those looking to checkpoint-restart in Singularity containers, I got a minimum working example working using DMTCP. It successfully checkpoint-restarts a simple executable running inside a Singularity container on my local machine and on a CircleCI virtual machine. I'm still working on getting it running properly in a cluster computing context. You can find the source here and pull it down for an interactive demonstration. Anyways, hope this might be useful!

I'll chime in as well, it would be an awesome feature for many of the on-spot instances as well as preemption based HPC systems.

Was this page helpful?
0 / 5 - 0 ratings