TECHNOLOGY AREAS: Information Systems
OBJECTIVE: The United States Air Force is looking for technological innovations based on machine learning techniques for Dynamically Evolving Malware Detection in Data and Network Streams.
DESCRIPTION: Polymorphic malware poses increasing challenges to effective signature-based antivirus protection; antivirus defenses have managed to stay ahead in the virus-antivirus co-evolution race only because antivirus signature updates are specific and targeted, whereas most polymorphic malware variation is random and undirected. More powerful malware mutation strategies that use automated machine learning to adapt to signature updates in the wild are being examined . By tailoring their mutations to specific signature updates, such malware can reliably survive signature updates without re-propagating, posing potentially serious threats to existing network infrastructures. The relentless appearance of novel malicious and non-malicious executables can be conceptualized as a data stream in which each data point is an executable. New kinds of attacks and mutations constitute concept drift in such a stream. Current machine learning-based classification approaches that are being examined to detect concept drift in data streams (e.g., ) show promise for detecting evolutionary malware. One open challenge in fully adapting these to malware detection is the difficulty of reliably and efficiently extracting useful features from binary executables, which tends to be substantially more difficult than the feature-extraction problem for purely textual streams . A second challenge is concept evolution—the continuous appearance of novel classes (e.g., new types of malware) in the stream . Machine learning approaches that assume a fixed number of classes are therefore impractical for malware detection. In order to detect new malware as well as reactive and adaptive malware, dynamic mechanisms for novel class detection based on sophisticated machine learning algorithms are needed. Such mechanisms should account for both concept drift and concept evolution to reliably detect new and old malware variants in infinite-length binary data streams. Furthermore, practical detection mechanisms should be time-constrained so that prompt action can be taken.
PHASE I: Conduct a preliminary investigation of machine learning-based malware detection with novel classes in a controlled environment, as well as novel class detection in an unconstrained textual environment (e.g., web blog content).
PHASE II: Develop a proof-of-concept demonstration of the technology in a real-world environment, with real-time applications.
PHASE III DUAL USE COMMERCIALIZATION
Military Application: Tools developed from this research will be used for active defensive operations, such as robust malware detection and quick analysis of web content for malicious intent.
Commercial Application: Results of the research will be useful in commercial intrusion/malware detection systems and stream (e.g., text and binary) classification systems.