This topic is eligible for the DARPA Direct to PHASE II Pilot Program. Please see section 7.0 of the DARPA instructions for additional information. To be eligible, you must submit documentation which demonstrates that Phase I feasibility (as described in PHASE I below). Offerors must choose between submitting a PHASE I proposal OR a Direct to Phase II proposal, and may not submit both for the same topic. OBJECTIVE: Investigate the national security threat posed by public data available either for purchase or through open sources. Based on principles of data science, develop tools to characterize and assess the nature, persistence, and quality of the data. Develop tools for the rapid anonymization and de-anonymization of data sources. Develop framework and tools to measure the national security impact of public data and to defend against the malicious use of public data against national interests. DESCRIPTION: The vulnerabilities to individuals from a data compromise are well known and documented now as"identity theft."These include regular stories published in the news and research journals documenting the loss of personally identifiable information by corporations and governments around the world. Current trends in social media and commerce, with voluntary disclosure of personal information, create other potential vulnerabilities for individuals participating heavily in the digital world. The Netflix Challenge in 2009 was launched with the goal of creating better customer pick prediction algorithms for the movie service . An unintended consequence of the Netflix Challenge was the discovery that it was possible to de-anonymize the entire contest data set with very little additional data. This de-anonymization led to a federal lawsuit and the cancellation of the sequel challenge . The purpose of this topic is to understand the national level vulnerabilities that may be exploited through the use of public data available in the open or for purchase. Could a modestly funded group deliver nation-state type effects using only public data? The threat of active data spills and breaches of corporate and government information systems are being addressed by many private, commercial, and government organizations. The purpose of this research is to investigate data sources that are readily available for any individual to purchase, mine, and exploit. The marketing community uses large-scale data aggregators, big data analytics, and social science techniques to deliver highly targeted advertising campaigns. Does the availability of data for purchase or for free, advanced marketing techniques (e.g., collaborative filtering, computational advertising), and low-cost big data analytic capabilities (e.g., Amazon EC2) provide a determined adversary with the tools necessary to inflict nation-state level damage? To what extent could a non-state actor collect, process, and analyze a portfolio of purchased and open source data to reconstruct an organizational profile, fiscal vulnerabilities, location of physical assets, work force pattern-of-life, and other information , in order to construct a deliberate attack on a specific capability? The goal of this topic is to develop tools to characterize and assess the nature, persistence, and quality of data. The tools should be based on principled scientific methods for sampling and relevant statistical methods for assessment. Also of interest are tools to characterize the quality of data for automated processing and analysis (i.e., a measure of how much manpower would be required to use a specific source). Additionally, the goal of this topic is to characterize the threat through the creation of tools, techniques, and methodologies to measure the vulnerabilities in a given set of public data. As an example, reconstructing the profile of an organization from many data pieces using low computational-complexity methods might indicate vulnerability. Also of interest to this topic is the development of sensors, tools, and techniques necessary to defend against the malicious use of data for purchase. Throughout the performance of this research (Phases I, II, and III), there will be no indefinite collection or storage of data sources containing personal identifying information (PII). Develop a proof-of-concept system that can automatically sample data from numerous sources, characterize the data, and provide automatic feedback on the measurable risk inherent with various collections of data. Develop methodology for risk assessment and mitigation through reallocation of resources. PHASE I: Investigate the landscape of public data both open and purchasable across several domains (e.g., GIS, webpages, consumer data, social media, etc.), through statistical data characterization and assessment. Develop a set of risk factors for vulnerability including complexity of the computation for compromise, and design a prototype tool set necessary to automatically measure the risk inherent in the data. Develop a plan for detailed implementation of methods in PHASE II and III, including a data privacy plan. DIRECT TO PHASE II - Offerors interested in submitting a Direct to PHASE II proposal in response to this topic must provide documentation to substantiate that the scientific and technical merit and feasibility described in the PHASE I section of this topic has been met and describes the potential commercial applications. Documentation should include all relevant information including, but not limited to: technical reports, test data, prototype designs/models, and performance goals/results. Read and follow Section 7.0 of the DARPA Instructions. PHASE II: Develop a proof-of-concept system that can automatically sample data from numerous sources, characterize the data, and provide automatic feedback on the measurable risk inherent with various collections of data. Develop methodology for risk assessment and mitigation through reallocation of resources. PHASE III: DOD entities including Army, Navy, and Air Force are interested in operational security and not having their plans and operations compromised through vulnerabilities in public data. In addition, closing the any gaps in such vulnerabilities will minimize the attack front, in which the commercial organizations have interest. The goals for this Phase, aimed at developing capabilities for defensive countermeasures, are as follows. Deploy a tool into a near-real-time environment that continually monitors available open source data, measures vulnerabilities, and provides defensive countermeasures. Develop a series of capabilities relevant to both government and commercial organizations to defend against threats due to the proliferation of purchasable or public data sets. Deploy a tool into a near-real-time environment that continually monitors available open source data, measures vulnerabilities, and provides defensive countermeasures. Develop a series of capabilities relevant to both government and commercial organizations to defend against threats due to the proliferation of purchasable or public data sets.