[Colloq] Thesis Proposal - Kapil Arya - Adaptive Checkpoint-Restart using Process Virtualization - Friday May 10th, 11:00am, 366 WVH

Jessica Biron bironje at ccs.neu.edu
Tue May 7 15:13:24 EDT 2013


PhD Thesis Proposal 

Kapil Arya 

Date: Friday May 10, 2013 
Time: 11:00 am 
Location: 366 WVH 


Title: Adaptive Checkpoint-Restart using Process Virtualization 


Checkpoint-Restart is the ability to save a set of running processes to 
a checkpoint image on disk, and to later restart it from disk. Apart 
from the obvious use in recovering from a system failure, other use 
cases include debugging and save/restore workspace. Previous 
checkpoint-restart packages were inflexible. They tried to avoid the 
issue of adapting to a changing external environment. Some of the 
techniques used were a modification of the kernel to reproduce the older 
execution environment, and disconnecting from all networks before a 
checkpoint, declining to support license and authentication servers. 
Two other problems for previous packages were adapting to a new file 
path on a new computer host, and handling of transient files (files 
deleted from the filesystem). Further, most previous packages supported 
distributed computations only in a specific context such as MPI (Message 
Passing Interface) for HPC (High Performance Computing) applications. 

In this work, I propose a plugin architecture that allows 
checkpoint-restart systems to adapt to an altered environment during 
restart. Two primary mechanisms used to support plugins are: 
1. virtualization of kernel resource names and external agent names; and 
2. saving and restoring the associated state at the time of 
checkpoint and restart. 
Modularity is achieved by creating a separate plugin for each supported 
external resource or external agent. Examples of resources supported by 
plugins are TCP/IP networking, the Infiniband network, the KVM virtual 
machine, the X11 graphics system, and the SSH remote terminal protocol. 
Further, the plugin mechanism is exposed to the end-user, where 
user-specific requirements can be handled. 

Related virtualization techniques are applied to solving the 
double-swapping problem for virtual machines. Adaptive 
checkpoint-restart is also applied in order to generalize the 
old idea of algorithmic debugging. 

Version 2.0 of DMTCP (Distributed MultiThreaded CheckPointing) is 
planned as a plugin-based redesign of the older monolithic DMTCP 
package. 

Committee: 
Gene Cooperman (Advisor) 
Pete Manolios 
Alan Mislove 
Wil Robertson 
Alex Garthwaite (External Member) 


More information about the Colloq mailing list