You are here

Data-Driven Authorship Feature Extraction and Comparative Analysis using Machine Learning


TECHNOLOGY AREA(S): Electronics 

OBJECTIVE: Extract and compare unique human authorship identifiers from a broad array of digital data sets. A software system will be developed implementing artificial general intelligence to perform an automated analysis that will associate these unique identifiers to single individuals, small groups, organizations or virtual personas from digital data sets that can source from written text (e.g. – social/dark web media, emails, SMS text, manuscripts, articles, music compositions, software programs, hand written letters/notes) and artwork (e.g. – pictures, graffiti and tattoos).  

DESCRIPTION: Develop the analytical and pattern recognition capability to automatically detect and decipher unique identifying signatures within the style of the written script to identify characterization attributes such as the first language of the author education level, personality type, self-esteem, mental state, and gender. Analysis should also reveal certain biographical traits such as nationality, place of origin, and current location; reveal political, religious, or extremist orientation; and intent. Categorize the author or group by training the machine learning model to recognize different languages and anticipate future written style changes (due to maturity, potential mental and emotional state, and physical handicap) via adaptive learning. Machine learning will aid in identifying key attributes of authorship on targeted networks or social media sites and search for particular identifiers to discern particular authors of interest. This capability should also have the ability to persistently monitor targeted networks and search for additional attributable artifacts that can be associated to the author of interest. This capability will derive relevant information from various types of multi-domain information to identify, locate, and associate person(s) of interest or organization to inform intelligence and support cyberspace operations. 

PHASE I: Research and draft a white paper that lists the types of development approaches, algorithms, software, risks, schedule, and costs to automatically decipher a particular identifier through multiple data sets. Utilizing machine and adaptive learning techniques identify authorship via typed and written language on documents, artwork, and social media. The research shall be able to determine the traits of a person or group native origin, intent, and behavior. 

PHASE II: Develop and demonstrate capabilities/functions via software prototype on non-government PC or PC laptop connected to a non-attributed IP address to test against a controlled data on a nongovernmental network.  

PHASE III: The technology shall be transitioned to PEO IEW&S Program of Records by ensuring that the software or algorithms meet DoD Information Assurance practices. The Contractor shall assist PEO IEW&S to test, troubleshoot, and assess the integration of developed capability into a designated Program of Record. 


1: Nurfadhina Mohd Sharef, and Shahrul Azman Mohd Noah, "Linguistic Patterns-Based Translation for Natural Language Interface" 2014 International Conference on Information Science and Applications (ICISA), 6-9 May 2014 INSPEC Accession Number: 14431762, IEEE Xplore: 8 July 2014 DOI 10.1109/ICISA2014.6847424

2:  Ahmen M. Mohsen, Nagwa N. El-Makky and Nagia Ghanem, "Author Identification Using Deep Learning" 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Number: 16651340, IEEE Xplore: 18-20 Dec 2016 DOI 10.1109/ICMLA.2016.0161

3:  Jun Yu, Yong Rui, Yuan Yan Tang, and Dacheng Tao, "High-Order Distance-Based Multiview Stochastic Learning in Image Classification" 2014 IEEE Transactions on Cybernetics, Number 14759317, IEEE: 17 Mar 2014 DOI 10.1109/TCYB.2014.2307862

4:  Powell, John E, David Brannan, and Anders Strindberg "Creating a Learning Organization for State, Local, and Tribal Law Enforcement to Combat Violent Extremism", NAVAL POSTGRADUATE SCHOOL MONTEREY CA MONTEREY United States, Defense Technical Information Center site Accession Number AD1029903, 01 Sep 2016

KEYWORDS: Author(ship), Artificial Intelligence, Belief, Comparative Analysis, Composition, Computer Vision, Data-set Small-scale, De-noising Auto-encoder, Deterministic, Discriminative Training, Document Processing, Feature Extraction, Identification, Identity, Language, Nationality, Machine Learning, Media, Neural Nets, Pattern Recognition, Support Vector Machine (SVM), Written Scripts 


Michael Semenoro 

(443) 861-0690 

Ray McGowan 

(443) 861-0687 

US Flag An Official Website of the United States Government