Automated Data Cleansing in Data Harvesting and Data Migration
Small Business Information
1155 Commerce Park Drive, Oak Ridge, TN, 37831
AbstractIn the explosion of digitized information that is available through corporate databases, data stores, and online search systems, a persistent problem is the management of the sheer volume of information identified. This information comes in the form of unstructured, semi-structured, and structured data. One of the key issues that exacerbate this information overload is the production of duplicate or near-duplicate information. In addition, the near-duplicate items frequently have different origins, creating a situation in which each item may have unique information of value, but the differences are not significant enough to justify maintaining the items as separate entities. This project will develop a toolset to identify and remove duplicate and near-duplicate items in the context of a system that allows unique information in a set of near-duplicate items to be consolidated into a single comprehensive item. The approach involves coupling the Secure Hashing Algorithm with Latent Semantic Indexing and other technologies applied to a representative set of information factors. The ability to process a representative set of factors will be demonstrated in Phase I. Phase II will focus on enlarging the domain of characteristics that can be incorporated to increase system effectiveness. Commercial Applications and other Benefits as described by the awardee: The DOE has created electronic versions of all reports, including paper reports created in the early days of research. Over the years, a number of duplicate or near-duplicate entries have been introduced. The new toolset should be able to identify and remove these duplicate and near-duplicate documents. Other government agencies and commercial organizations also should benefit greatly from this technology. Private sector applications include Lexis-Nexis and Dialog, and the toolset could be extended to services such as Factiva and other specialized web databases
* information listed above is at the time of submission.