Fault-Tolerant Metacomputing for Defense Applications
Agency / Branch:
DOD / DARPA
Parallel and distributed processing on "metacomputers" formed by networking scalable parallel computers with COTS workstations and PCs is clearly the wave of the future for high-performance computing. The reasons are obvious: cost and performance. The Department of Defence can benefit greatly, provided that the real-time and fault-tolerance requirements of key defense applications can be met. This project will design and assess a complete hierarchical environment supporting real-time, fault-tolerant applications on metacomputers. At the bottom, the environment will exploit basic system infrastructure for real-time support, basic group and communication services, information about hardware and software status, and fault recovery services. Above that will be a scalable, reliable global store suitable for use by the operating system and by applications to preserve critical state information. Exploiting the store will be tools for monitoring and control and job/resource scheduling, both able to work cooperatively and intelligently to run real-time parallel applications adaptively, efficiently, and correctly, even in the presence of failures. Since other projects are developing the low-level infrastructure, the focus of this project will be on the high-level portions of the hierarchy, particularly the virtual store and monitoring tool, and on design of a suitable API for development of fault-tolerant applications.
Small Business Information at Submission:
Principal Investigator:Robert Bjornson
One Century Tower, 265 Church Street New Haven, CT 65107
Number of Employees: