Development, Design, and Implementation of Fault Management Technologies
NASA’s science program has well over 100 spacecraft in operation, formulation, or development, generating science data accessible to researchers everywhere. As science missions have increasingly complex goals—often on compressed timetables—and have more pressure to reduce operation costs, system autonomy must increase in response.
Fault management (FM) is a key component of system autonomy, serving to detect, interpret, and mitigate failures that threaten mission success. Robust FM must address the full range of hardware failures, and also must consider failure of sensors or the flow of sensor data, harmful or unexpected system interaction with the environment, and problems due to faults in software or incorrect control inputs—including failure of autonomy components themselves.
Despite lessons learned from past missions, spacecraft failures are still not uncommon, and reuse of FM approaches is limited, illustrating deficiencies in our approach to handling faults in all phases of the flight project lifecycle. The need exists at both extremes of space exploration: At one end, well-funded, resource-rich missions continue to experience difficulties due to system complexity, computing capability that fails to keep pace with expanding mission goals, and risk-averse design, ultimately curtailing mission capability and mission objectives when traditional fault management approaches cannot adequately ensure mission success. At the other end, very small and high-risk missions are flourishing because of advances in computing, microdevices, and low-cost access to space, but autonomy and fault management are increasingly seen as essential because of the high probability of faults and extreme resource limitations that make deliberative, ground-directed fault recovery impractical.
Although this subtopic addresses particular interest in onboard FM capabilities (namely, onboard sensing approaches, computing, algorithms, and models to assess and maintain spacecraft health), the goal is to provide a system capability for management of future spacecraft. Offboard components such as modeling techniques and tools, development environments, and verification and validation (V&V) technologies are also relevant, provided they contribute to novel or capable onboard fault management.
Needed innovations in FM can be grouped into the following two categories:
- Fault management operations approaches: This category encompasses FM "in-the-loop," including algorithms, computing, state estimation/classification, machine learning, and model-based reasoning. Further research into fault detection and diagnosis, prognosis, fault recovery, and mitigation of unrecoverable faults is needed to realize greater system autonomy.
- Fault management design and implementation tools: Also sought are methods to formalize and optimize onboard FM, such as model-based system engineering (MBSE). New technologies to improve or guarantee fault coverage, manage and streamline complex FM, and improve system modeling and analysis significantly contribute to the quality of FM design and may prove decisive in trades of new versus traditional FM approaches. Automated test case development, false positive/false negative test tools, model V&V tools, and test coverage risk assessments are examples of contributing technologies.
Specific algorithms and sensor technologies are in scope, provided their impact is not limited to a particular subsystem, mission goal, or failure mechanism. Novel artificial-intelligence-inspired algorithms, machine learning, etc., should apply to this and only this subtopic if their design or application is specific to detection, classification, or mitigation of system faults and off-nominal system behavior. Although the core interests of this subtopic are spacecraft resilience and enabling spacecraft autonomy, closed-loop FM for other high-value systems such as launch vehicles and test stands is also in scope, particularly if the techniques can be easily adapted to spacecraft.
Related technologies, but without a primary focus on resolution of system faults, such as machine-learning approaches to spacecraft characterization or science data pre-processing, autonomy architectures, or generalized system modeling and design tools, should be directed to other subtopics such as S17.04, Application of Artificial Intelligence for Science Modeling and Instrumentation; or S17.02, Integrated Campaign and System Modeling.
Expected outcomes and objectives of this subtopic are to mature the practice of FM, leading to better estimation and control of FM complexity and development costs, more flexible and effective FM designs, and accelerated infusion into future missions through advanced tools and techniques. Specific objectives include the following:
- Increase spacecraft resilience against faults and failures.
- Increase spacecraft autonomy through greater onboard fault estimation and response capability.
- Increase collection and quality of science data through mitigation of interruptions and fault tolerance.
- Enable cost-effective FM design architectures and operations.
- Determine completeness and appropriateness of FM designs and implementations.
- Decrease the labor and time required to develop and test FM models and algorithms.
- Improve visualization of the full FM design across hardware, software, and operations procedures.
- Determine the extent of testing required, completeness of verification planned, and residual risk resulting from incomplete coverage.
- Increase data integrity between multidisciplinary tools.
- Standardize metrics and calculations across FM, systems engineering (SE), safety and mission assurance (S&MA), and operations disciplines.
- Bound and improve costs and implementation risks of FM while improving capability, such that benefits demonstrably outweigh the risks, leading to mission infusion.
Expected TRL or TRL Range at completion of the Project: 3 to 4
Primary Technology Taxonomy:
- Level 1 10 Autonomous Systems
- Level 2 10.2 Reasoning and Acting
Desired Deliverables of Phase I and Phase II:
Desired Deliverables Description:
The aim of the Phase I project should be to demonstrate the technical feasibility of the proposed innovation and thereby bring the innovation closer to commercialization. Note, however, the research and development (R&D) undertaken in Phase I is intended to have high technical risk, and so it is expected that not all projects will achieve the desired technical outcomes.
The required deliverable at the end of an SBIR Phase I contract is a Final Report that summarizes the project’s technical accomplishments. As noted above, it is intended that proposed efforts conduct an initial proof of concept, after which successful efforts would be considered for follow-on funding by Science Mission Directorate (SMD) missions as risk-reduction and infusion activities. Research should be conducted to demonstrate technical feasibility and NASA relevance during Phase I and show a path toward a Phase II prototype demonstration.
The Phase I Final Report should thoroughly document the innovation, its status at the end of the effort, and as much objective evaluation of its strengths and weaknesses as is practical. The report should include a description of the approach along with foundational concepts and operating theory, mathematical basis, and requirements for application. Results should include strengths and weaknesses found and the measured performance in tests where possible.
Additional deliverables may significantly clarify the value and feasibility of the innovation. These deliverables should be planned to demonstrate retirement of development risk, increasing maturity, and targeted applications of particular interest. Although the wide range of innovations precludes a specific list, some possible deliverables are listed below:
- For innovations that are algorithmic in nature, this could include development code or prototype applications, demonstrations of capability, and results of algorithm stress testing.
- For innovations that are procedural in nature, this may include sample artifacts such as workflows, model prototypes and schema, functional diagrams, examples, or tutorial applications.
- Where a suitable test problem can be found, documentation of the test problem and a report on test results should illustrate the nature of the innovation in a quantifiable and reproducible way. Test reports should discuss maturation of the technology, implementation difficulties encountered and overcome, and results and interpretation.
Phase II proposals require at minimum a report describing the technical accomplishments of the Phase I award and how these results support the underlying commercial opportunity. Describing the commercial potential is best done through experiment: Ideally the Phase II report should describe results of a prototype implementation to a relevant problem, along with lessons learned and future work expected to adapt the technology to other applications. Further demonstration of commercial value and advantage of the technology can be accomplished through steps such as the following:
- Delivery of the technology in software form, as a reference application, or through providence of trial or evaluation materials to future customers.
- Technical manuals, such as functional descriptions, specifications, and user guides.
- Conference papers or other publications.
- Establishment of a preliminary performance model describing technology metrics and requirements.
Each of these measures represents a step taken to mature the technology and further reduce the difficulty in reducing it to practice. Although it is established that further development and customization will continue beyond Phase II, ideally at the conclusion of Phase II a potential customer should have access to sufficient materials and evidence to make informed project decisions about technology suitability, benefits, and risks.
State of the Art and Critical Gaps:
Many recent SMD missions have encountered major cost overruns and schedule slips due to difficulty in implementing, testing, and verifying FM functions. These overruns are invariably caused by a lack of understanding of FM functions at early stages in mission development and by FM architectures that are not sufficiently transparent, verifiable, or flexible enough to provide needed isolation capability or coverage. In addition, a substantial fraction of SMD missions continue to experience failures with significant mission impact, highlighting the need for better FM understanding early in the design cycle, more comprehensive and more accurate FM techniques, and more operational flexibility in response to failures provided by better visibility into failures and system performance. Furthermore, SMD increasingly selects missions with significant operations challenges, setting expectations for FM to evolve into more capable, faster-reacting, and more reliable onboard systems.
The SBIR program is an appropriate venue because of the following factors:
- Traditional FM design has plateaued, and new technology is needed to address emerging challenges. There is a clear need for collaboration and incorporation of research from outside the spaceflight community, as fielded FM technology is well behind the state of the art and failing to keep pace with desired performance and capability.
- The need for new FM approaches spans a wide range of missions, from improving operations for relatively simple orbiters to enabling entirely new concepts in challenging environments. Development of new FM technologies by SMD missions themselves is likely to produce point solutions with little opportunity for reuse and will be inefficient at best compared to a focused, disciplined research effort external to missions.
- SBIR level of effort is appropriately sized to perform intensive studies of new algorithms, new approaches, and new tools. The approach of this subtopic is to seek the right balance between sufficient reliability and cost appropriate to each mission type and associated risk posture. This is best achieved with small and targeted investigations, enabled by captured data and lessons learned from past or current missions, or through examination of knowledge capture and models of missions in formulation. Following this initial proof of concept, successful technology development efforts under this subtopic would be considered for follow-on funding by SMD missions as risk-reduction and infusion activities. Research should be conducted to demonstrate technical feasibility and NASA relevance during Phase I and show a path toward a Phase II prototype demonstration.
Relevance / Science Traceability:
FM technologies are applicable to all SMD missions, albeit with different emphases. Medium-to-large missions have very low tolerance for risk of mission failure, leading to a need for sophisticated and comprehensive FM. Small missions, on the other hand, have a higher tolerance for risks to mission success but must be highly efficient, and are increasingly adopting autonomy and FM as a risk mitigation strategy.
A few examples are provided below, although these may be generalized to a broad class of missions:
- Lunar Flashlight (currently in assembly, test, and launch operations (ATLO), as an example of many similar future missions): Enable very low cost operations and high science return from a 6U CubeSat through onboard error detection and mitigation, streamlining mission operations. Provide autonomous resilience to onboard errors and disturbances that interrupt or interfere with science observations.
- Europa Lander: Provide onboard capability to detect and correct radiation-induced execution errors. Provide reliable reasoning capability to restart observations after interruptions without requiring ground in the loop. Provide MBSE tools to model and analyze FM capabilities in support of design trades, of FM capabilities, and coordinated development with flight software. Maximize science data collection during an expected short mission lifetime due to environmental challenges.
- Rovers and rotorcraft (Mars Sample Return, Dragonfly, future Mars rotorcraft): Provide onboard capability for systems checkout, enabling lengthy drives/flights between Earth contacts and mobility after environmentally induced anomalies (e.g., unexpected terrain interaction). Improve reliability of complex activities (e.g., navigation to features, drilling and sample capture, capsule pickup, and remote launch). Ensure safety of open-loop control or enable closed-loop control to prevent or mitigate failures.
- Search for extrasolar planets (observation): Provide sufficient system reliability through onboard detection, reasoning, and response to enable long-period, stable observations. Provide onboard or onground analysis capabilities to predict system response and optimize observation schedule. Enable reliable operations while out of direct contact (e.g., deliberately occluded from Earth to reduce photon, thermal, and radio-frequency background).
- NASA's approach to FM and the various needs are summarized in the NASA FM Handbook: https://www.nasa.gov/pdf/636372main_NASA-HDBK-1002_Draft.pdf
- Additional information is included in the talks presented at the 2012 FM Workshop:
- Another resource is the NASA Technical Memorandum "Introduction to System Health Engineering and Management for Aerospace (ISHEM),"
- This is greatly expanded on in the following publication: Johnson, S. (ed): System Health Management with Aerospace Applications, Wiley, 2011, https://www.wiley.com/en-us/System+Health+Management+with+Aerospace+Applications-p-9781119998730
- FM technologies are strongly associated with autonomous systems as a key component of situational awareness and system resilience. A useful overview was presented at the 2018 SMD Autonomy Workshop, archiving a number of talks on mission challenges and design concepts: https://science.nasa.gov/technology/2018-autonomy-workshop