Thursday, 25 March 2010

Putting a computer science spin on genetic diagnostics

Collections of genetic profiles have continued to grow steadily, but scientists have struggled a bit with finding the most effective way to use them. In a paper published in PNAS this week, a group of researchers took one of the larger gene expression data repositories and sought to parse its disease-related data with a few computational techniques. They were able to use the resulting database in conjunction with a diagnostic program to accurately diagnose a given gene expression profile up to 95 percent of the time.

Gene expression data can be used to identify what differences in expression are likely to be connected to the presence of a certain disease. The formal association of a gene with a disease is known as an "annotation." However, getting the expression data and annotations into a usable form has been a challenge, and previous approaches have been limited to straightforward queries, asking the database to match a given profile or a phenotype. This approach leaves a lot of information untapped.

Scientists realized they could improve the usability of genetic databases by sorting their expression profiles into disease classes, and then querying the database with similar profiles. This would turn the databases into a predictive diagnostic tool—it would take gene expression profiles as input, find other matching profiles, and then check the matches for their disease annotations.

First, researchers standardized gene expression profiles by sorting them into a hierarchical system of disease classifications. They compared each diseased gene array result to a normal expression profile, and took the logarithm of the difference between them. This ratio of differences gave researchers profiles to work with that were standardized across a collection platforms and labs. They also evaluated the similarities between standardized profiles to identify correlations between gene combinations and diseases. Finally, they standardized the disease annotations associated with genes using the Unified Medical Language System.

Once their database of profiles was fully standardized, researchers created Bayesian classifiers for each disease grouping. Bayesian probability is based on evaluating the likelihood of one event given the probability of another, as well as the probability of a correctly positive test. For example, if a blood sample tests positive for cancer, Bayesian probability states that the the probability of that person actually having cancer is based on the accuracy of the test, the independent probability of someone getting cancer, and the independent probability of testing positive for cancer.

Classifiers like these allow the program to evaluate an expression profile based on disease prevalence in similar profiles. Aside from the number of variables it accounts for, Bayesian systems are also able to "learn" and take into account new information, which is ideal for a genetic database where new samples are being added all the time.

With the classifiers in place, the diagnostic database was ready to use. When it was fed a query profile to figure out what diseases the person behind the profile might be prone to, the database would assess the profile's similarity to others it had on record and pull up the relevant Bayesian disease classes. The program could then read out the annotated disease concepts that correlated with the query profile.

Overall, the system had a diagnostic accuracy rate of 95 percent, with a precision of 82 percent. Researchers found the accuracy of the results was significantly improved when they applied a second Bayesian step for error correction. They also found that more datasets produced much more accurate results—for example, a test for a rare disease that only had three datasets associated with it in the system had a diagnostic precision of only 41 percent.

In addition to diagnosing diseases, the database was also fairly adept at finding relationships between diseases and drugs, provided that profiles contained information on the effects of medications. The system was able to recover many known drug side effects, and also suggested new disease-drug relationships. For example, they were able to construct a disease drug map that linked an anticancer drug, doxorubicin, to skin disorders (the drug has a side-effect of skin inflammation) and to cardiovascular disease (it has a cumulative toxic effect on the heart over time).

Some have expressed apprehension about the use of genetic diagnoses, in part because their predictions are somewhat unreliable. This program could potentially overcome those concerns by making diagnoses more robust, and providing some quantification of the uncertainties.The authors note that the system's diagnostic accuracy and precision should continue to improve as more samples become available. Its creators also hope to integrate more phentoypes into the database, such as gene expression changes associated with stress responses and cell differentiation, possibly creating another map that could be overlaid on the genetic one to provide a different kind of predictive information.