You are here

Information Extraction for New Emerging Noisy User-generated Micro-Text



OBJECTIVE: Leverage the latest research developments in NER and related information extraction tasks to better cope with challenging but important noisy user-generated text - “chat” data - based on state of the art deep learning techniques. 

DESCRIPTION: Noisy user-generated text such as that found in social media, web forums, online reviews, twitch chats, etc., are increasingly becoming important sources of information since they tend to reflect real intentions, raw sentiments, unfiltered opinions, secret plans, etc. However, noisy user-generated text presents great challenges to information extraction tasks including Named Entity Recognition (NER) because of its colloquial style of language, improper grammatical structures, spelling inconsistencies, informal abbreviations, slang, emoticons, etc. In addition, such data makes detecting named entities, which forms the basis of information extraction pipelines, difficult because noisy-user generated text often contains rare, unusual, previously-unseen and rapidly-changing emerging entities with unusual surface forms. Recent university and commercial studies have shown that common approaches to information extraction on such data performs very poorly and does not generalize well enough to handle rare and emerging entity types [Augenstein et al., 2017] [Pottenger et al., 2015]. This demonstrates that information extraction, including NER, is still an unsolved task in such data and needs continual research into approaches with better generalization capabilities. To this end, several notable approaches have recently been proposed, which are all based on state of the art deep learning techniques. One of them is the bidirectional LSTM approach proposed by a University of Cambridge research team [Limsopatham et al., 2016], which won first place in the Workshop on Noisy User-generated Text (WNUT) 2016 challenge. Another is based on a multi-task deep neural network approach based on LSTMs, proposed by a team from the University of Houston [Aguilar, G. et al., 2017], which achieved the best performance in WNUT 2017. More recently, breakthroughs in NER are coming from incorporation of neural language models as evidenced by [Tran et al., 2017] [Peters et al., 2017] [Liu et al., 2017]. Deep learning coupled with neural language models exceeds previous performance records by a significant margin and shows promising results. These approaches, however, have not yet been extended to noisy user-generated data and lack consideration of specific properties of such noisy data 

PHASE I: Perform deep study analysis of state of the art deep learning techniques and perform analysis on which one or a combination of approaches provides knowledge capture in the micro-text information space. Obtain baseline measurements that can be used in phase II for development of applications based on feasibility demonstration developed in phase 1. Techniques to be investigated include but are not limited to character-level language models (LM) incorporated into deep neural networks to discover hidden information in character-level irregularities common in noisy user-generated text; LSTM approaches and NER applications. 

PHASE II: Leverage the latest research developments in NER and results from phase I and other related information extraction tasks to better cope with challenging but important noisy user-generated micro-text – like “chat” data – based on state of the art deep learning techniques. Develop an Open Architecture application approach that will provide improved NER and related information extraction performance for noisy social media “chat” text, and at the same time will reduce the requirements of human effort needed to create labeled ground truth. This may be demonstrated because character language models not only encode complexities of language such as grammatical (and lexical) structure, but also distill information available in vast unannotated corpora [Jozefowicz et al., 2016]. 

PHASE III: Develop an application approach that can be employed to satisfy requirements of multiple open architecture implementations such as OA DCGS, ICITE, Apple or Android interfaces. 


1. Aguilar, G. et al., A Multi-task Approach for Named Entity Recognition in Social Media Data, WNUT 2017; 2. Derczynski, L. et al., Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition, WNUT 2017; 3. Tran, Q. et al., Named Entity Recognition with stack residual LSTM and trainable bias, arXiv 1706.07598, 2017 4. Liu, L. et al., Empower Sequence Labeling with Task-Aware Neural Language Model, arXiv 1709.04109, 2017; 4. Rafal Jozefowicz, Exploring the Limits of Language Modeling, arXiv 1602.02410, 2016 6. Limsopatham, N. et al., Bidirectional LSTM for Named Entity Recognition in Twitter Messages, WNUT 2016 7. William M. Pottenger, et al., SURREAL Final Report, A

KEYWORDS: Information Extract, Micro-text, Named Entity Recognition, NER, Sentiment Analysis, Noisy Text, User Generated, Long Short Tern Memory (LSTM), Deep Learning, Active Learning 

US Flag An Official Website of the United States Government