Resolving Biological Entity References (Text/Databases)

Award Information
Department of Health and Human Services
Award Year:
Phase II
Award Id:
Agency Tracking Number:
Solicitation Year:
Solicitation Topic Code:
Solicitation Number:
Small Business Information
181 N. 11TH ST, SUITE 401, NEW YORK, NY, 11211
Hubzone Owned:
Minority Owned:
Woman Owned:
Principal Investigator:
(917) 292-8845
Business Contact:
() -
Research Institute:
DESCRIPTION (provided by applicant): The Big Idea driving this project is that human memory is small, the body of scientific knowledge is vast and that breakthroughs are possible if software can do a better job of connecting researchers with knowledge in text and databases. The first step is to stop looking for words (as a search engine does) in data but instead try to find facts in data. A fact is something claimed in text or a database explicitly. A fact may not be true but we want to develop software t hat finds those facts. What does a fact look like? Consider the following sentence from MEDLINE: Recently, we have found that Htt is an antiapoptotic protein in striatal cells and acts by preventing caspase-3 activity. It contains the fact that the gene id 6532 (in the Entrez Gene database) regulates gene id 836. Software can extract such facts from a sentence like this. But the current state-of-the-art is not doing a great job of it. The reason is simple-current systems are focused on not making mistakes which means that they miss a lot of opportunities to find facts. The best reported performance is around 40% of the facts being found which we think is severely compromising the usefulness of text mining technologies in bioinformatics. This is where we ar e trying a different approach-we are focused on finding all the facts. We call this total recall which we demonstrated was possible in Phase I but total recall comes with a price: we make lots of mistakes. The key innovation is that we keep score of how confident we are of any given fact which gives us an important point of leverage in sifting good from bad facts. Our Phase II proposal focuses on developing techniques to reason over such fact heavy analysis by exploring soft clustering approaches, structu red classification and effective user interface design. We have partnered with Harvard, Columbia and Pfizer to keep our research effort focused on problems that actually matter for genomics experiments and early phase drug discovery. In addition we fit int o the NIH's data sharing policy by making our software free (with source code) to organizations who make their data free too. We, as do many others, believe that many great scientific discoveries lay implicit and just below the surface of the research lite rature. All that is required is for the right researcher to see the right sentence or database entry to form a novel hypothesis and cure a disease. Total recall approaches to fact extraction make that all the more likely an outcome. The dominant paradigm i n text mining is to treat the text like a database. But researchers would be better served with a more search like approach to extracting and correlating facts in text and databases. We are committed to making all the facts, or total recall, available to scientists which is currently not available.

* information listed above is at the time of submission.

Agency Micro-sites

SBA logo

Department of Agriculture logo

Department of Commerce logo

Department of Defense logo

Department of Education logo

Department of Energy logo

Department of Health and Human Services logo

Department of Homeland Security logo

Department of Transportation logo

Enviromental Protection Agency logo

National Aeronautics and Space Administration logo

National Science Foundation logo
US Flag An Official Website of the United States Government