You are here

Synthetic User Personas (SUP)


OUSD (R&E) CRITICAL TECHNOLOGY AREA(S): Integrated Sensing and Cyber OBJECTIVE: The objective of Synthetic User Personas (SUP) is to generate labeled, synthetic cyber data suitable for enabling machine learning algorithms that support holistic cyber defenses. DESCRIPTION: Currently there is little to no labeled unified host and network data available to the cyber research community to facilitate the development and testing of machine learning algorithms for cyber defenses. Both network data, that captures network connections and packet flows, and host or endpoint data, that captures the use of applications and other activities on a machine, are necessary to build comprehensive cyber defenses. There are two categories of existing datasets. The first is anonymized data from networks with human users such as that provided by the Los Alamos National Laboratory [1]. The second is synthetic data generated and collected from a cyber exercise [2]. The data generated using each approach has significant problems that prevent its use in developing and testing machine learning algorithms. One strength of anonymized data is that none of the events are synthetic. The data represents the actual activity on the network from which it was collected. What anonymized data typically lacks, however, is any sense of ground truth. Anonymized data usually contains limited events, preventing more realistic enrichments and limiting the scope of detection algorithms that can be trained. Using anonymized data from a real network also raises the question of whether there was a malicious actor active when collecting the data, and as a result, there is malicious activity represented in the dataset. If there was a malicious actor, there is no reliable or practical way to identify the specific events that were produced by the actor’s activity. As a result, such datasets are not suitable for training machine learning algorithms. Further, elements of anonymized datasets may not be consistent amongst each other since there may be correlations in the actual collected data that are not recognized and preserved by the anonymization process. Alternatively, synthetic data can be easily and automatically annotated with ground truth (e.g., accurately identify and label benign and malicious events). However, to date, synthetic datasets lack the realism required to fully support development, training, and testing of machine learning algorithms. Synthetic datasets also typically contain unwanted artifacts that reduce or eliminate their value (e.g., artificial artifacts introduce biases when training machine learning algorithms). The existence of unified endpoint and network data is rare because of several issues. First, the anonymization of endpoint data and network data both present unique challenges. Notably, the limitless variations of potential endpoint data makes anonymizing it impossible for arbitrary use cases. Preserving correlations among data elements in both types of data is also incredibly challenging, and again impossible for generalized cases. Second, the collection of endpoint data for research purposes typically requires Institutional Review Board approval. Third, the configuration management of and policies governing the endpoints may prohibit the deployment of a collection agent. Finally, most commercial agents do not make collected endpoint telemetry available for local analysis, instead sending it for centralized (e.g., cloud) processing. SUP will implement synthetic agents designed to generate user activity without creating spurious network or host artifacts. SUP will not create a self-hosted agent that generates activity and filters out its own events from the event stream. Rather, SUP will passively and remotely interpret data (e.g., from a computer screen) to understand the machine state, and then interact with the machine using external input sources (e.g., keyboard and mouse), thus emulating human users. All of the generation activity is “off box” so that no generational artifacts contaminate the collected data. This is a key factor in ensuring that the collected data is free of any spurious artifacts that may incorrectly bias machine learning algorithms generated from the synthetic data. The “off box” synthetic agents implemented by SUP will be capable of scaling to at least five hundred (500) hosts within an enterprise test network. Additionally, SUP will provide for the ability to generate and record user activity and associated data without the addition of software on the subject hosts and without relying on remote logins to the subject hosts. Ideally, the lightweight “off box” synthetic agents will be built with a language that natively supports concurrency, enabling straightforward scaling well beyond the 500-host requirement. SUP will be able to respond correctly and continue proper operation after unexpected pop-ups and other operating system notifications occur. This will be implemented without reliance on timeouts or waiting periods to avoid unknown dialogs; and SUP will not make any assumptions as to when dialogs may or may not appear. SUP will continue to operate properly when the screen resolution changes unexpectedly (i.e., dependencies on image matching should work at any resolution without the need for code changes or a collection of images at every possible resolution). Within an enterprise environment, typically there are many different types of employees and departments that need to be protected, each of which may represent different types of user behavior, with communications closely matching organizational groups and software use differing as well. SUP will enable the emulation of multiple user profiles to provide a variety of realistic user behaviors across the environment. User modeling efforts may span multiple levels of complexity. For example, activities may be performed at random, quickly changing from web browsing tasks to e-mail. Alternatively, a specific workflow may be defined, providing scripts or playbooks from which to draw on actions. Finally, complex emergent behaviors may be built on models of real human behavior. Previous work has explored many approaches to user behavior modeling. Amirkhanyan et al. [3] looked at modeling user behavior using graphical methods called user behavior state graphs. Drawing on human factors research, Garg et al. [4] included features such as nervousness, typing speed, and mouse movement behaviors into user behavior patterns that could then be replicated in a testbed environment. Blythe et. al [5] explored using the Belief-Desire-Intention model for creating intelligent agents that were capable of using planning and reaction to achieve preset goals. These methods were demonstrated as part of the Deter Agents Simulating Human-Behavior module that is a part of DeterLab [6]. Additionally, the GHOSTS-SPECTRE project has also demonstrated use of machine learning methods to drive web browsing behavior in support of data generation while modeling changing user preferences [7]. Traditionally, user data generation has used an agent “on box.” The agent creates artifacts as a result of its activity and those artifacts must be filtered out if possible; otherwise, they may introduce biases into any learned algorithm. PHASE I: This topic is soliciting Direct to Phase II (DP2) proposals only. Phase I feasibility will be demonstrated through evidence of: a completed feasibility study or a basic prototype system; definition and characterization of properties desirable for both Department of Defense (DoD) and civilian use; and comparisons with alternative state-of-the-art methodologies (competing approaches). Proposers interested in submitting a DP2 proposal must provide documentation to substantiate that the scientific and technical merit and feasibility described above have been met and describe the potential commercial applications. DP2 documentation should include: • technical reports describing results and conclusions of existing work, particularly regarding the commercial opportunity or DoD insertion opportunity, and risks/mitigations, and assessments; • presentation materials and/or white papers; • technical papers; • test and measurement data; • prototype designs/models; • performance projections, goals, or results in different use cases; and, • documentation of related topics such as how the proposed SUP solution can enable more realistic cyber training. This collection of material will verify mastery of the required content for DP2 consideration. DP2 proposers must also demonstrate knowledge, skills, and ability in networking, computer science, mathematics, and software engineering. For detailed information on DP2 requirements and eligibility, please refer to the DoD BAA and the DARPA Instructions for this topic. PHASE II: The goal of SUP is to generate realistic synthetic data that is void of artifacts and capable of scaling. An average cyber operator should not be able to determine that the data is synthetic by looking at the generated data, even when the operator has knowledge of typical human activity that was modeled when generating the synthetic data. The operator’s view of user behavior is limited to the event activity of the user; the operator will not have visibility of the actual content created by the user. The SUP prototype should easily scale as a result of its architecture and implementation language. DP2 proposals should present systems that: • generate realistic synthetic data without artifacts such that an average operator cannot determine that the event data is synthetic; and • scale to at least five hundred (500) end user machines. • Phase II will culminate in a system demonstration using one or more compelling use cases consistent with commercial opportunities and/or insertion into a DARPA program. The below schedule of milestones and deliverables is provided to establish expectations and desired results/end products for the Phase II effort. Deliverables must include: • a software implementation of SUP for a virtualized test environment; and • an example dataset generated by SUP suitable for machine learning research. The Phase II Option period will further mature the technology for insertion into a larger DARPA Program, DoD/Intelligence Community (IC) Acquisition Program, another Federal agency; or commercialization into the private sector. Schedule/Milestones/Deliverables: Proposers will execute the research and development (R&D) plan as described in the proposal. • Month 1: Phase I Kickoff briefing (with annotated slides) to the DARPA Program Manager (PM) including: any updates to the proposed plan and technical approach, risks/mitigations, schedule (inclusive of dependencies) with planned capability milestones and deliverables, proposed metrics, and plan for prototype demonstration/validation. • Months 4, 7, 10: Quarterly technical progress reports detailing technical progress made, tasks accomplished, major risks/mitigations, a technical plan for the remainder of Phase II (while this will normally report progress against the plan detailed in the proposal or presented at the Kickoff briefing, it is understood that scientific discoveries, competition, and regulatory changes may all have impacts on the planned work and DARPA must be made aware of any revisions that result), planned activities, trip summaries, and any potential issues or problem areas that require the attention of the DARPA PM. • Month 12: Interim technical progress briefing (with annotated slides) to the DARPA PM detailing progress made (include quantitative assessment of capabilities developed to date), tasks accomplished, major risks/mitigations, planned activities, technical plan for the second half of Phase II, the demonstration/verification plan for the end of Phase II, trip summaries, and any potential issues or problem areas that require the attention of the DARPA PM. • Month 15, 18, 21: Quarterly technical progress reports detailing technical progress made, tasks accomplished, major risks/mitigations, a technical plan for the remainder of Phase II (with necessary updates as in the parenthetical remark for Months 4, 7, and 10), planned activities, trip summaries, and any potential issues or problem areas that require the attention of the DARPA PM. • Month 24 (Final Phase II Deliverables): Final technical progress briefing (with annotated slides) to the DARPA PM. Final architecture with documented details; a demonstration of the ability to generate artifact-free data at scale; documented application programming interfaces; and any other necessary documentation (including, at a minimum, user manuals and a detailed system design document; and the end-of-phase commercialization plan). Month 30 (Phase II Option period): Interim Option period technical progress briefing (with annotated slides) to the DARPA PM. Interim report of prototype performance against existing state-of-the-art technologies documenting key technical gaps towards productization. • Month 36 (Phase II Option period): Final Option period technical progress briefing (with annotated slides) to the DARPA PM. Final Phase II Option period report of prototype performance against existing state-of-the-art technologies, including quantitative metrics for scalability, assessments of realism, and costs, risks, and schedule for implementation of the full prototype capability into a government-chosen test facility. PHASE III DUAL USE APPLICATIONS: SUP has potential applicability across DoD, IC, U.S. Government (USG), and commercial entities. For DoD/IC/USG, SUP is extremely well-suited for large-scale cyber exercises, smaller-scale operator training, weapon system software testing, and automation of rote tasks. SUP has the same applicability as DoD/IC/USG for the commercial sector. The Phase III work will be oriented towards transition and commercialization of the developed SUP technologies. The proposer is required to obtain funding from either the private sector, a non-SBIR Government source, or both, to develop the prototype into a viable product or non-R&D service for sale in military or private sector markets. Phase III refers to work that derives from, extends, or completes an effort made under prior SBIR funding agreements, but is funded by sources other than the SBIR Program. Primary SUP support will be to national efforts to explore application of artificial intelligence (AI) to improve generation of realistic user events captured during cyber-security testing on synthetic ranges. AI technologies will provide the foundation for developing sophisticated user behavior models that can be used in cyber range exercises. In particular, it is important that these models are realistic and do not bias machine learning approaches because of predictable artifacts. Results of SUP are intended to improve the quality of cyber ranges used across academia, industry, and government REFERENCES: 1. Los Alamos National Laboratory. “Advanced Research in Cyber Systems Data Sets.” Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration, July 5, 2022. 2. DARPA (2020) Operationally Transparent Cyber Data Release [Data Set]. Available at: 3. Amirkhanyan, A., Sapegin, A., Gawron, M., Cheng, F., & Meinel, C. (2015, September). Simulation user behavior on a security testbed using user behavior states graph. In Proceedings of the 8th International Conference on Security of Information and Networks (pp. 217-223). Available at: 4. Garg, A., Vidyaraman, S., Upadhyaya, S., & Kwiat, K. (2006, April). USim: a user behavior simulation framework for training and testing IDSes in GUI based systems. In 39th Annual Simulation Symposium (ANSS'06) (pp. 8-pp). IEEE. Available at: 5. Blythe, J., Botello, A., Sutton, J., Mazzocco, D., Lin, J., Spraragen, M., & Zyda, M. (2011, August). Testing cyber security with simulated humans. In Twenty-Third IAAI Conference. Available at: 6. University of Southern California Information Sciences Institute (USC-ISI). “The cyber DEfense Technology Experimental Research (DETER) Lab Capabilities.” The DETER Project, USC-ISI, July 5, 2022. 7. Carnegie Mellon University Software Engineering Institute (2020) GHOSTS-SPECTRE [Source Code] Available at: KEYWORDS: Machine Learning, Cyber, Artificial Intelligence, Automation, Data, Analytics
US Flag An Official Website of the United States Government