[Colloq] Thesis Proposal - Kapil Arya - Adaptive Checkpoint-Restart using Process Virtualization - Friday May 10th, 11:00am, 366 WVH
Jessica Biron
bironje at ccs.neu.edu
Tue May 7 15:13:24 EDT 2013
PhD Thesis Proposal
Kapil Arya
Date: Friday May 10, 2013
Time: 11:00 am
Location: 366 WVH
Title: Adaptive Checkpoint-Restart using Process Virtualization
Checkpoint-Restart is the ability to save a set of running processes to
a checkpoint image on disk, and to later restart it from disk. Apart
from the obvious use in recovering from a system failure, other use
cases include debugging and save/restore workspace. Previous
checkpoint-restart packages were inflexible. They tried to avoid the
issue of adapting to a changing external environment. Some of the
techniques used were a modification of the kernel to reproduce the older
execution environment, and disconnecting from all networks before a
checkpoint, declining to support license and authentication servers.
Two other problems for previous packages were adapting to a new file
path on a new computer host, and handling of transient files (files
deleted from the filesystem). Further, most previous packages supported
distributed computations only in a specific context such as MPI (Message
Passing Interface) for HPC (High Performance Computing) applications.
In this work, I propose a plugin architecture that allows
checkpoint-restart systems to adapt to an altered environment during
restart. Two primary mechanisms used to support plugins are:
1. virtualization of kernel resource names and external agent names; and
2. saving and restoring the associated state at the time of
checkpoint and restart.
Modularity is achieved by creating a separate plugin for each supported
external resource or external agent. Examples of resources supported by
plugins are TCP/IP networking, the Infiniband network, the KVM virtual
machine, the X11 graphics system, and the SSH remote terminal protocol.
Further, the plugin mechanism is exposed to the end-user, where
user-specific requirements can be handled.
Related virtualization techniques are applied to solving the
double-swapping problem for virtual machines. Adaptive
checkpoint-restart is also applied in order to generalize the
old idea of algorithmic debugging.
Version 2.0 of DMTCP (Distributed MultiThreaded CheckPointing) is
planned as a plugin-based redesign of the older monolithic DMTCP
package.
Committee:
Gene Cooperman (Advisor)
Pete Manolios
Alan Mislove
Wil Robertson
Alex Garthwaite (External Member)
More information about the Colloq
mailing list