Description: OBJECTIVE: As a first objective, conduct basic and applied research surrounding new technologies to computer-generate completely synthetic, complex, medical text narratives for subsequent use in clinical informatics research and healthcare information technology feasibility studies. As a second objective, conduct basic and applied research surrounding new technologies to computer-generate completely synthetic medical images for subsequent use in clinical informatics research and healthcare information technology feasibility studies. If the research is successful, computer-generated synthetic medical text and images could then be made available to the government and/or other private researchers through commercial or open source licensing agreements. DESCRIPTION: What is the problem? Medical research studies, particularly those involving medical informatics or health care information technology evaluations, are hampered by the lack of available, timely, high quality structured, unstructured, and imaging data. Similar research studies conducted by academia and private industry labs face the same challenge. Acquiring such data from Military Health System Automated Systems of Record, Commercial Electronic Health Record (EHR), or Clinical/Business Intelligence systems is a timely process. It typically requires: (1) obtaining permission from research subjects, (2) submission of detailed Human Subjects Research Protection protocols, (3) First and Second Level Institutional Review Boards (IRB) Approval, (4) obtaining Data Use Agreements, (5) Privacy Board Approval, and (6) hiring experts to pull and de-identify or anonymize electronic health records data. This process can typically take up to a year to complete. Even if real data became available in a timely fashion, there is always the chance that it would not be completely de-identified or anonymized. This is particularly true in the case of complex patient histories containing medical text narratives that may mention familial relationships or other unique facts that could be pieced together to identify the patient. De-identification and/or anonymization is typically conducted through computer-algorithms that are not 100 percent trust-worthy. Even in the case of additional human review after computer-assisted de-identification or anonymization, complete de-identification and/or anonymization cannot be guaranteed. Human review is typically limited by available resources to read large amounts of patient histories, and is generally not available for research data sets involving thousands of records. These constraints pose additional challenges and risks involving protecting patient privacy as required by HIPAA and Human Subjects"Review. Data breaches can lead to patient medical and financial identity theft, public relations concerns, and subsequent class action law suits. Obtaining imaging data is also difficult. As one researcher indicates,"The collection of medical image data for research can be an expensive time consuming task. Positron emission tomography (PET), x-ray computed tomography (CT), and magnetic resonance imaging (MRI) systems can easily cost over a million dollars. They may require dedicated staff, maintenance contracts, and access to expensive supporting equipment such as a cyclotron. In addition, collection of data for large studies may take months. The process is complicated by equipment schedules, organization of volunteers/subjects, use of potentially harmful electromagnetic radiation, radiopharmaceuticals, and contrast agents, as well as patient privacy rights. These difficulties limit the availability of clinical data, especially for smaller academic research programs. Creating software models of the human anatomy and imaging systems, and modeling the medical physics of the imaging acquisition process can provide a means to generate realistic synthetic data sets. In many cases synthetic data sets can be used, reducing the time and cost of collecting real images, and making data sets available to institutions without clinical imaging systems."In the envisioned ideal state, medical research studies could be expedited if synthetic medical text narrative data and images could be computer-generated. Having synthetic data would forgo the need for IRB and DUA approvals because the data would be completely made up, and have absolutely no ties to any real patient records. Furthermore, there would be no chance of HIPAA violations, patient medical or financial identity theft, public relations concerns, or any class-action law suits. Why is it hard? Automatically generating synthetic health data is not trivial. Healthcare data is complex and synthetically generating such different types of data may require different technologies which arise from artificial intelligence work. For example, Semi-structured or free text clinical narratives can contain: A clinician"s free text or semi-structured assessment of the patient"s history, probable diagnosis, and recommended procedures: radiology text reports pathology text reports operation text reports discharge narrative summaries medical boards and disability profiles Such unstructured and semi-structured data could contain: complex demographic information regarding the patient and those whom he/she interacts multiple diagnosis or problems and their progression over time patient and family proteomic and genomic history patient exposures to major trauma and related procedures patient environmental or toxin exposures patient nutrition, exercise, and sleep data key social and life events over time that impact health status Imaging data is also complex and may include radiographs, MRI studies, CT scans, cardiology, nuclear medicine, and ultrasound studies, to name a few. In order to use such synthetically generated data in subsequent clinical research studies or healthcare IT proof of concept evaluations, it must be valid, reliable, complete, timely, and clinically-relevant over the life of a patient. Determining whether the synthetically-generated data is of sufficient quality for use in subsequent clinical research requires extensive human testing and review. In lieu of obtaining such real data for validation of synthetic data, the synthetically generated computer data would be validated solely by human subject matter expertise and knowledge of the real data. The research use case for the data would dictate what quality of synthetic data would be acceptable for use. For example, basic healthcare IT system functional testing and feasibility studies may be able to be validated with lesser quality synthetic data than which might be required for validating population health studies, which would require very high quality synthetic data. There would be additional challenges to ingesting such synthetic data into healthcare information technology development and testing environments, as government and commercial EHRs use different underlying logical and physical data models. Ideally, synthetically generated data should in a format capable of ingestion into an electronic health record, and should consider the various standard component formats (C.XX), specified by the Healthcare Information Technology Standards Panel (HITSP), and the Continuity of Care Record (CCR) and Clinical Document Architecture (CDA) standards, or even new efforts such as HL-7 FHIR, or the OMG Clinical Information Modeling Initiative backed by Mayo Clinic and Intermountain Healthcare. DICOM standards would apply to imaging studies. Research in this field is somewhat hampered by the number of qualified artificial intelligence individuals available, although there is promising research underway with respect to generating synthetic medical data. How is it solved today? There are very few open-source or commercial synthetic medical data sets available today for use in clinical research. Available synthetic data sets are limited to very specific purposes. Most importantly, there are no known large, general purpose, complete sets of synthetic medical narrative text and imaging data that can be incorporated into EHRs for use clinical informatics studies or healthcare information technology feasibility studies, although some research has started in this regard. NIH did release an initial Observational Medical Dataset Simulator (OSIM1) in 2009 which was used to generate datasets with millions of hypothetical patients with drug exposure, background conditions, and known adverse events for the purpose of benchmarking methods performance. Continued research has resulted in the development of a second-generation simulated dataset procedure, known as OSIM2. OSIM2 represents an alternative design to accommodate additional complexities observed in real-world data, including advanced modeling of the correlations between drugs and conditions. OSIM2 allows for more direct comparisons between simulated data and real observational databases, and should enable greater methods evaluation by allowing assessment of how methods accommodate these complex interrelationships. Anna L Buczak,1 Steven Babin,1 and Linda Moniz report in BioMedical Central a novel methodology for"generating complete synthetic EMRs both for an outbreak illness of interest (tularemia) and for background records. The method developed has three major steps: 1) synthetic patient identity and basic information generation; 2) identification of care patterns that the synthetic patients would receive based on the information present in real EMR data for similar health problems; 3) adaptation of these care patterns to the synthetic patient population". The research did generate EMRs, including visit records, clinical activity, laboratory orders/results and radiology orders/results, but it was limited to only 203 synthetic tularemia outbreak patients. Lombardo and Moniz report on a similar method for generating synthetic data to study disease outbreaks. The Partners healthcare i2b2 challenge has made available several public data sets for Natural Language Processing Studies. The U.S. Army Medical Research and Materiel Command, Telemedicine and Advanced Technology Research Center (TATRC), through its contracted partners, is generating some structured electronic health record data such as patient demographics; lab, pharmacy, and radiology order and results; and disease and procedure codes. This research effort is also auto-generating very simple patient histories consisting of several lines of text notating that a patient may have presented for a clinical encounter or admission or has history of a particular disease or procedure, but these short narratives are insufficient for natural language processing studies and other clinical informatics studies. The government is aware of ongoing efforts to synthetically generate business intelligence stories and sports stories. Other research efforts are underway in the Netherlands to create compelling, entertaining, and believable stories by focusing on plot creation, discourse generation and spoken language presentation. There are numerous web sites emerging which can generate children"s stories. The underlying technologies might be applied to generating free or semi-structured medical text. The exact technologies employed are unknown and may be proprietary but most likely center around use of Latent Semantic Analysis to generate synthetic text narrative data. Some research underway to apply computer graphics technologies to create entertainment movies and even virtual worlds, which might be now applied to generating synthetic medical images. Some research involving synthetic medical image generation is underway by academic institutions and companies in the Rochester, NY area, which is major worldwide center of image excellence. The exact techniques employed by these efforts are unknown, although some research may be building upon the generation of synthetic data involving earth geographic sensor data . Proposed Solution Summary: The need for synthetic data for use in clinical research and healthcare IT feasibility studies is well documented. This SBIR topic is intended to build upon the aforementioned research efforts to generate complex, story-like, unstructured or semi-structured electronic healthcare data for use in medical research. In addition, the research will develop toolsets to generate synthetic digital images (XRAY, CT, MRI, Ultrasound, Pathology and Dermatology), using Computer-Generated Imagery. The government is interested in evaluating innovative proposals outlining various approaches to generating such synthetic imaging and complex text narratives, which either build upon existing approaches, or are entirely new approaches. Such synthetic data must be generated in a way that does not rely on any form of past data that was real, de-identified, or anonymized patient medical data, and should be derived"ab-initio". The research should also include novel methods to compare the validity of the synthetic data to real data, for use in clinical informatics research and health information technology feasibility studies. Boundaries to Consider: The quality of the synthetic data to be generated will be dictated by the current clinical informatics research and healthcare information technology feasibility studies underway, some of which are known, and some of which will be determined by higher authority closer to award. The government will negotiate this aspect with the vendor. Synthetically generated data should in a format capable of ingestion into an electronic health record, and should consider the various standard component formats (C.XX), specified by the HHS Office of the National Coordinator for Healthcare Information Technology, the past work of the Healthcare Information Technology Standards Panel (HITSP), and the Continuity of Care Record (CCR) and Clinical Document Architecture (CDA) standards, and associated HL-7 FHIR and OMG Clinical Information Modeling Initiatives. DICOM standards apply to imaging studies. Once generated based on certain characteristics, the applied research should aim to find ways to effectively ingest such data into existing healthcare information technology evaluation platforms, such as the TATRC Early Stage Platform for subsequent use in healthcare IT prototype and risk reduction activities, under an appropriate licensing agreement. As a matter of background the TATRC Early Stage Platform provides a virtualized development and test environment containing current DOD electronic health record components (AHLTA, CHCS and Essentris in the future), which can then be used to support third party application development and other government funded clinical informatics research projects awarded to other government research recipients. PHASE I: In Phase I, the offeror will outline the various technical approaches which exist for developing high quality synthetic medical images and complex narrative texts, which can then be used in clinical informatics research and health information technology studies. During Phase I, the vendor will work with the Government to understand the various clinical informatics and health IT research underway, for which the synthetic data is to be used. This will guide further decisions regarding the quality of the data which is to be generated. Phase I will be largely centered on whether it is even technically feasible to generate the complex narrative medical text necessary for use in clinical informatics research given the current state of artificial intelligence research. At the end of Phase I, the government will need to get a sense of whether it should even proceed with Phase II. Therefore, it is important that the offeror provide a final report that presents the relative advantages, disadvantages, and tradeoffs of each technical approach for generating synthetic data, and how each approach builds upon past knowledge concerning synthetic generation of data, and addresses current gaps. The offeror may propose how it would improve upon existing approaches, or develop entirely new approaches to generate synthetic medical images and complex narrative medical text. It is also important to note that Phase I will extend beyond this tradeoff analysis. The offeror will also work with the government to develop use cases and configuration parameters surrounding the generation of synthetic data, which can demonstrate the feasibility of commercializing such technology and/or applying it in military medical research settings. As part of Phase I, the offeror will develop evaluation criteria to judge the quality of the imaging and narrative free or semi-structured text data for use in research, based on the use cases envisioned. Research would also center around gaining an understanding of the complexities of generating such data for potential ingestion into the U.S. Army Telemedicine and Advanced Technology Research Center Early Stage Platform, or other similar government lab platform, which will include the DOD Electronic Health Record (currently AHLTA, CHCS, and Essentris, as well as commercial and open source Electronic Health Records that the government may be considering for acquisition. PHASE II: The Phase II effort, dedicated towards creating a new toolset, or configuring an existing toolset, to generate approximately 10,000 medical images and complex medical narrative texts, that can be ingested into the TATRC Early Stage Platform (ESP), or otherwise be made available to TATRC-funded research partners. The vendor should propose whether it will make these data sets available under an open source or commercial licensing arrangement, and at what cost. Phase II will include a qualitative and quantitative comparative analysis of the quality of the synthetic data set as judged by government subject matter experts. During Phase II, TATRC will work with the awardees to introduce their technology to particular functional sponsors who might make use of the data under the proposed open source or commercial license. This would provide a potential technology transition route in Phase III in to a military medical acquisition office, perhaps using a Military Health System or DOD-wide enterprise licensing agreement, but it is not guaranteed that this would occur. For those vendors considering releasing the toolsets of synthetically generated data set to the open source community, TATRC will introduce the awardees to OSEHRA. PHASE III DUAL USE APPLICATIONS: Phase III efforts would be aimed at bringing the development of synthetic medical images and narrative medical texts to a state where it represented the equivalent of real data, and could be used reliably to conduct medical informatics research. At the end of Phase III, the SBIR recipient may be able to continue to license the tool sets and generated synthetic data to JPC-1, TATRC, or other military medical customers and TATRC specified partners for a particular time period. Outside of the military medical world, the offeror might be able to license the data to commercial vendors of electronic health records, health data warehouses, or health information exchanges for use in furthering development of these technologies. As one example, the Veterans Administration and/or HHS might also be interested in the use of such medical synthetic data sets for use in their research programs. Offerors should note that SBIR Phase III refers to work that derives from, extends, or logically concludes effort(s) performed under prior SBIR funding agreements, but is funded by sources other than the SBIR Program. Phase III work is typically oriented towards technology transition to Acquisition Programs of Record and/or commercialization of SBIR research or technology. In Phase III, the small business is expected to obtain funding from non-SBIR government sources and/or the private sector to develop or transition the prototype into a viable product or service for sale in the military or private sector markets.