USA flag logo/image

An Official Website of the United States Government

Automated Data Cleansing in Data Harvesting and Data Migration

Award Information

Department of Energy
Award ID:
Program Year/Program:
2007 / SBIR
Agency Tracking Number:
Solicitation Year:
Solicitation Topic Code:
Solicitation Number:
Small Business Information
Information International Associates, Inc.
1055 Commerce Park Drive Suite 110 Oak Ridge, TN 37830
View profile »
Woman-Owned: No
Minority-Owned: No
HUBZone-Owned: No
Phase 1
Fiscal Year: 2007
Title: Automated Data Cleansing in Data Harvesting and Data Migration
Agency: DOE
Contract: DE-FG02-07ER84709
Award Amount: $100,000.00


In the explosion of digitized information that is available through corporate databases, data stores, and online search systems, a persistent problem is the management of the sheer volume of information identified. This information comes in the form of unstructured, semi-structured, and structured data. One of the key issues that exacerbate this information overload is the production of duplicate or near-duplicate information. In addition, the near-duplicate items frequently have different origins, creating a situation in which each item may have unique information of value, but the differences are not significant enough to justify maintaining the items as separate entities. This project will develop a toolset to identify and remove duplicate and near-duplicate items in the context of a system that allows unique information in a set of near-duplicate items to be consolidated into a single comprehensive item. The approach involves coupling the Secure Hashing Algorithm with Latent Semantic Indexing and other technologies applied to a representative set of information factors. The ability to process a representative set of factors will be demonstrated in Phase I. Phase II will focus on enlarging the domain of characteristics that can be incorporated to increase system effectiveness. Commercial Applications and other Benefits as described by the awardee: The DOE has created electronic versions of all reports, including paper reports created in the early days of research. Over the years, a number of duplicate or near-duplicate entries have been introduced. The new toolset should be able to identify and remove these duplicate and near-duplicate documents. Other government agencies and commercial organizations also should benefit greatly from this technology. Private sector applications include Lexis-Nexis and Dialog, and the toolset could be extended to services such as Factiva and other specialized web databases

Principal Investigator:

Edrick G. Coppock

Business Contact:

Franciel Azpurua
Small Business Information at Submission:

Information International Associates, Inc.
1155 Commerce Park Drive Oak Ridge, TN 37831

EIN/Tax ID: 621500232
Number of Employees:
Woman-Owned: No
Minority-Owned: No
HUBZone-Owned: No