You are here

Machine Learning Detection of Source Code Vulnerability


TECHNOLOGY AREA(S): Information Systems


Develop and demonstrate a software capability that utilizes machine-learning techniques to scan source code for its dependencies; trains cataloging algorithms on code dependencies and detection of known vulnerabilities, and scales to support polyglot architectures.


Nearly every software library in the world is dependent on some other library, and the identification of security vulnerabilities on the entire corpus of these dependencies is an extremely challenging endeavor. As part of a Development, Security, and Operations (DevSecOps) process, this identification is typically accomplished using the following methods: (a) Using static code analyzers. This can be useful but is technically challenging to implement in large and complex legacy environments. They typically require setting up a build environment for each version to build call and control flow graphs, and are language-specific and thus do not work well when there are multiple versions of software using different dependency versions. (b) Using dynamic code review. This is extremely costly to implement, as it requires a complete setup of an isolated environment, including all applications and databases a project interacts with. (c) Using decompilation to perform static code analysis. This is again dependent on software version and is specific to the way machine-code is generated.

The above methods by themselves generate statistically significant numbers of false positives and false negatives: False positives come from the erroneous detection of vulnerabilities and require a human in the loop to discern signal from noise. False negatives come from the prevalence of undetected altered dependent software (e.g., copy/paste/change from external libraries).

Promising developments from commercial vendors provide text mining services for project source trees and compare them against vulnerability databases, such as Synopsis/Blackduck Hub, IBM AppScan, and Facebook's Infer. However, these tools are costly to use and require the packaging of one's code to be uploaded to a third-party service.

Work produced in Phase II may become classified. Note: The prospective contractor(s) must be U.S. owned and operated with no foreign influence as defined by DoD 5220.22-M, National Industrial Security Program Operating Manual, unless acceptable mitigating procedures can and have been implemented and approved by the Defense Counterintelligence Security Agency (DCSA). The selected contractor and/or subcontractor must be able to acquire and maintain a secret level facility and Personnel Security Clearances, in order to perform on advanced phases of this project as set forth by DCSA and NAVWAR in order to gain access to classified information pertaining to the national defense of the United States and its allies; this will be an inherent requirement. The selected company will be required to safeguard classified material IAW DoD 5220.22-M during the advanced phases of this contract.


Develop a concept for a design for a software utility that:

  • Performs text mining on source trees so that it (a) accurately identifies all declared and undeclared dependencies, and (b) does not require a setup of the build environment.
  • Trains algorithms to catalog multiple vulnerability databases, both public and internal to the Defense and Intelligence communities, to detect known vulnerabilities, and delineate recommended fixes for the software developer.
  • Trains algorithms to catalog the libraries that many projects depend upon (e.g., OpenSSL), mapping their correct version, identifying known vulnerabilities in that version, and reconciling against the current project so that scanning the entire corpus of external dependencies is an efficient and scalable process (note: these parameters must also be able to be tuned for each project).
  • Detects if code was extracted from external libraries and manipulated to make it look as if it was organically produced (presumably using the above cataloging features).
  • Scales to support polyglot architectures.
  • Performs the above services for every version in a code repository so that vulnerabilities across multiple versions can be comprehensively tracked.

The feasibility study must show that the software utility can easily integrate into existing Continuous Integration/Continuous Development ( CI/CD) DevSecOps tools. Metrics for accuracy, scalability, and speed must also be provided. Develop integration plans for Phase II.

NOTE: Detailed knowledge of Navy data sources may not be necessary during Phase I if the performer can show the above. It is recommended to use publicly available open-source software repositories. For example, the Linux kernel, or the Chromium project, and leverage, for example, the National Vulnerability Database or Common Vulnerabilities and Exposures databases.


Develop, demonstrate, validate, and mature the Phase I-developed concepts into prototype software. Work with the Government to establish metrics and acceptance testing for the bullets listed in Phase I.

  • Demonstrate that the cataloging of dependent software packages can scale to internal and external dependent software packages.
  • Demonstrate that the number of source vulnerability databases can be expanded to include internal and external sources.
  • Demonstrate that the service can scan for vulnerabilities in more than two languages, to include Java, C++, and Python.
  • Demonstrate that the service can ingest custom vulnerability information using a known specification (e.g., SCAP, CWE).
  • Provide interfaces to ingest, process, and validate a user's custom source code and custom security bug information.
  • Establish/document a lifecycle maintenance plan for the Navy.

It is probable that the work under this effort will be classified under Phase II (see Description for details).


Integrate the service into an existing Navy CI/CD DevSecOps process:

  • Provide methods to rapidly ingest security and software package information.
  • Implement data procurement and on-boarding processes.
  • Develop product/service to a maturity level that allows it to enter the third party market as dependent software package management and security vulnerability identification tools in both the commercial and government sector.

    • Any commercial organization, private or public (e.g., Transportation, Medical Device Development, and/or the FDA), that does software verification and validation should be able to leverage the service.

      KEYWORDS: DevSecOps; Continuous Integration; Continuous Deployment; Software; Vulnerabilities; Legacy Code; Software Scanning; Vulnerability Databases; Development, Security and Operations


      1. Kratkiewicz, K. "Evaluating Static Analysis Tools for Detecting Buffer Overflows in C Code." Harvard University, Cambridge, MA, 2005.

      2. Meng, et al. "Assisting in Auditing of Buffer Overflow Vulnerabilities via Machine Learning." Mathematical Problems in Engineering, 2017.

      3. Jaspan, et al. "Advantages and Disadvantages of a Monolithic Repository: A Case Study at Google." Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, 2018, pp. 225-234.

      4. Lopes, et al. "DejaVu: A Map of Code Duplicates on GitHub." Proceedings of the ACM on Programming Languages, 1(OOPSLA), 2017, pp. 1-28.

      5. Russell, et al. "Automated Vulnerability Detection in Source Code Using Deep Representation Learning." 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757-762.

      6. Website of the National Institute of Standards and Technology, Information Technology Laboratory, Software and Systems Division. "Source Code Security Analyzers."

US Flag An Official Website of the United States Government