You are here

Setac: Enhancing Usability of Archived Weather Data in the Digital Age

Award Information
Agency: Department of Commerce
Branch: National Oceanic and Atmospheric Administration
Contract: NA21OAR0210493
Agency Tracking Number: 2943387
Amount: $149,882.00
Phase: Phase I
Program: SBIR
Solicitation Topic Code: 9.5
Solicitation Number: NOAA-OAR-OAR-TPO-2021-2006702
Solicitation Year: 2021
Award Year: 2021
Award Start Date (Proposal Award Date): 2021-09-01
Award End Date (Contract End Date): 2022-02-28
Small Business Information
3721-D University Dr.
Durham, NC 27707
United States
DUNS: 059333349
HUBZone Owned: No
Woman Owned: No
Socially and Economically Disadvantaged: No
Principal Investigator
 Jacob Vosburgh
 Computer Vision Scientist
 (919) 433-2400
Business Contact
 Margaret Reeves
Title: CFO
Phone: (919) 433-2400
Research Institution

NOAA has used historical documents such as ship logs and many other resources to collect weather data critical to modeling global and regional climate and weather conditions. To date, the optical character recognition (OCR) technology developed over the past three decades remains limited in the ability to recognize handwriting and reliably extract text in context. Machine Learning (ML) algorithms can help improve the processes.
Given the importance of accuracy for weather data, we propose the development and testing of a
custom OCR/text extraction application built using OpenCV and Tesseract. Both are open-source and
operate within the open-source Python environment. PyTorch will be evaluated as the deep learning
library to optimize the OpenCV and Tesseract integration and post processing.
We submit that this integration will provide more flexible pre-processing without undue complexity and understanding, require less post-processing, and establish a framework to add automation to pre/post processing and tuning compared to previous efforts.
Our objectives are to:
1. Demonstrate feasibility of OpenCV as an image pre-processing tool for document layout analysis
2. Demonstrate feasibility of Tesseract as text extraction tool
3. Demonstrate feasibility of using PyTorch as adaptive deep learning library for post-processing
and information extraction
4. Validate performance

* Information listed above is at the time of submission. *

US Flag An Official Website of the United States Government