Data Mining Software for Large-Scale Analyses of Infections Caused by Hepatitis Viruses



The incidence of hepatitis A (HAV) is falling rapidly, but outbreaks of food-borne HAV outbreaks occur occasionally, and can involve large numbers of people. An estimated 50,000 Americans acquire new hepatitis B (HBV) infection annually. The health burden of persistent or chronic HBV infection is also heavy, with high mortalities due to cirrhosis or cancer. Adding to the burden of 1.2 million people in the USA estimated to be chronically infected with HBV is an estimated 40,000 foreign-born, chronically infected persons per year resulting from immigration. Chronic HBV causes liver cancer, and this cancer is a leading cause of death for certain population groups in the USA such as Asian-Americans. Hepatitis C (HCV) is the most common chronic blood borne infection in the United States, affecting 3.2 million of Americans. Since 2000, an estimated 85,000 new HCV infections occur every year and approximately 19,000 new infections occurred in 2006. However, 50% of HCV infected persons are unaware of their infection. Additionally, several studies have estimated that 30% of HIV-infected individuals are also infected by HCV and 60-90% of individuals infected with HIV by intra-venous drug use (IVDU) have HCV. The total estimate for co-infected individuals in the USA is 300,000. About a fifth of the American population is estimated to be exposed to hepatitis E (HEV), but the health impact of this exposure is unknown.

Hepatitis viruses are a diverse group of viruses with different major modes of transmission. Hepatitis A and hepatitis E viruses (HAV and HEV) may cause food-borne outbreaks. Hepatitis B and hepatitis C viruses (HBV and HCV) are blood-borne viruses. Although, infection with any of the hepatitis viruses has a similar clinical presentation, the degree of sequence heterogeneity of these viruses varies. Recent advances in laboratory technologies and computational biology have facilitated a comprehensive sequence analysis of the genomes of hepatitis viruses. This information allows for further refinement of molecular epidemiological approaches and provides opportunities to link molecular epidemiological data to demographic, clinical, laboratory and epidemiological data. In the course of engagement in clinical and surveillance studies and outbreak investigations, CDC generates, collects and analyzes such data. Because of the diversity in the type and sources of these data the CDC’s seeks a software application that will be able to integrate these disparate datasets and that will permit data mining and investigator-initiated analysis.

Project Goal:  The goal of this project is to develop data mining software that extracts, transforms, and loads structured data relating to infection with hepatitis viruses from diverse sources into a warehouse appropriate for mainframe, client/server, and PC platforms. This data will include but may not be limited to demographic, clinical, epidemiological, laboratory and phylogenetic information. The software will store and manage the data in a relational database system with a web-based interface to provide data access to the scientific community and analysis of relationships in the stored data using end-user defined queries to discover disease patterns and trends. The software will have hook interfaces which allow data to be exported to external programs for additional analysis, and capture those results into the database. The software will also export the outcomes of such analyses in publication ready formats.

Impact:  It is expected that the software will generate associations between epidemiological and laboratory data leading to the discovery of new disease patterns, epidemiological trends and proteomic associations. Such discoveries are expected to lead to new strategies for public health interventions, surveillance, prophylaxis and the development of antivirals and vaccines. This software tool will be applicable not only to hepatitis viruses but other pathogens in the areas of epidemiology, laboratory research and public health.

