Optical Character Recognition (OCR) Automated Document Pre-processing Software

Description:

TECHNOLOGY AREA(S): Sensors, Electronics 

OBJECTIVE: Develop Optical Character Recognition (OCR) automated document pre-processing software that can be integrated into the US Army Machine Foreign Language Translation System (MFLTS) Software Architecture. Pre-processing software should provide automated document cleaning and correction for seamless OCR processing and machine translation (MT). 

DESCRIPTION: Optical Character Recognition (OCR) is an identified key system attribute (KSA) for the Army Machine Foreign Language Translation System (MFLTS) Program. As stated in the MFLTS Requirements Definition Package (APR 2015): "The first step in text language translation of hard copy documents is to have an accurate rendition of the original text for translation. Degraded or noisy documents of the type encountered in the operational environment where MFLTS will be used make character recognition difficult for OCR software. Degraded or noisy documents slow down the OCR process. Therefore, MFLTS must remove noise and improve the appearance of documents for high OCR accuracy, which will speed up the document translation process and provide more precise translations." OCR supports the following primary Army Phase 3 task: Exploitation of hard copy documents as collected at checkpoints, entry control points, base security operations, detainee and internment operations, and site exploitation. PD MFLTS partnered with CERDEC-I2WD to commission a performance study of commercial and GOTS OCR for Arabic script. This study was performed by Progeny Systems, and determined that none of the available products performed at the accuracy level required by PD MFLTS. This was particularly true for operationally-encountered documents that were not considered to be "clean," even though the products all incorporate limited image pre-processing. We anticipate that an advanced image pre-processing tool can sufficiently "clean-up" such documents to a degree that will allow the OCR products to transcribe the images at a higher level of accuracy. Also, an image pre-processing tool will apply to scripts other than Arabic, which will benefit PD MFLTS as it adds additional languages to its portfolio. Therefore, OCR automated document pre-processing must remove noise and improve the appearance of documents for high OCR and MT accuracy. OCR automated document pre-processing software must provide the removal of flaws such as speckle, watermarks, paper creases, stains, small holes, rough edges, lines on the paper, and copier noise and streaks. OCR automated document pre-processing software must not change or degrade document formatting (e.g., font sizes and font formatting elements such as underline, italic, and bold). 

PHASE I: Develop prototype software for an initial two writing systems. Initial writing systems are English and Arabic. Demonstrate prototype software on a variety of degraded writing samples. Note: If document noise removal is tied to specific languages, proposers must clearly identify these ties. All Phase I awards are required to identify a path forward for achieving compatibility / interoperability with MFLTS software architecture. Phase I development will provide prototype software that could be used to externally pre-process document images sent to MFLTS to yield improved OCR accuracy. 

PHASE II: Phase II development would result in a component that would be integrated into the MFLTS architecture to automatically or on-demand pre-process any document images ingested into MFLTS, improving the effectiveness of analysts using MFLTS to exploit captured foreign language documents. 

PHASE III: Software becomes a fully licensed, supported, fielded component of the MFLTS program. Potential to expand language sets beyond initial set. Phase III applications would include all commercial or military settings where there is a need to apply OCR to documents in less than pristine condition. (E.g., scans of documents post fire / flood, historical documents / scrolls) 

REFERENCES: 

1: Parker, Jon, Ophir Frieder, and Gideon Frieder. "Automatic Enhancement and Binarization of Degraded Document Images." In Document Analysis and Recognition (ICDAR), 2013 International Conference on, IEEE, (2013).

2:  Parker, Jon, Ophir Frieder, and Gideon Frieder. " Robust Binarization of Degraded Document Images Using Heuristics." In Proceedings of Document Recognition and Retrieval XXI, (2014).

3:  Yasser Alginahi (2010). Preprocessing Techniques in Character Recognition, Character Recognition, Minoru Mori (Ed.), ISBN: 978-953-307-105-3, InTech, Available from: Caution-http://www.intechopen.com/books/characterrecognition/preprocessing-techniques-in-character-recognition

 

KEYWORDS: Optical Character Recognition, Automated Pre-processing, Image Recognition 

Agency Micro-sites

SBA logo
Department of Agriculture logo
Department of Commerce logo
Department of Defense logo
Department of Education logo
Department of Energy logo
Department of Health and Human Services logo
Department of Homeland Security logo
Department of Transportation logo
Environmental Protection Agency logo
National Aeronautics and Space Administration logo
National Science Foundation logo
US Flag An Official Website of the United States Government