Molecular Genetics in the age of information overload

Posted by Gabe Musso, on 18 December 2013

Those of us who are of a certain age can remember standing overwhelmed at the video store, agonizing over which movie to rent. Of course today video stores in the US have been pushed to the brink of extinction by delivery and on-demand services like Netflix, who, ironically, have far more movies to offer. In 2006, Netflix offered one million dollars for the best answer to a simple question: how do we present thousands of movie choices in a way that isn’t overwhelming? Netflix knew that their library had something for everyone, and success was just a matter of bringing the most appealing options to their customers’ attention. Today, biologists face a similar problem. An abundance of expression, interaction, and sequence variation data can make the prospect of selecting genes for experimentation daunting. With the prize in this case being a better understanding of disease, can we pre-select genes in a meaningful way?

The idea of prioritizing genes for experimental assay is certainly not a new one. However, recent applications of machine learning concepts in biology have allowed predictions of gene function or phenotype to occur on a global scale. Put simply, machine learning is the automatic identification and exploitation of patterns, in this case, patterns of genes that have a phenotype we’re interested in. For example, many learning approaches can be generically classified as ‘guilt-by-profiling’. In these cases, profiles of gene features (e.g. tissue expression, protein domains) are examined for patterns corresponding to genes of a particular function or phenotype. These patterns can then be used to predict additional genes as sharing the same function/phenotype. Re-visiting the Netflix example, you see guilt-by-profiling in action every time a movie is recommended to you, such that features of movies you like are used in a predictive way. Similarly, in guilt-by-association relationships between gene pairs (e.g. co-expression, physical association, genetic interaction) are used to ‘transfer’ a function or phenotype from one gene to another.

Learning approaches have been used to prioritize gene functions across virtually all model organisms, and to predict phenotypes in yeast, C. elegans, and various cell lines. However, phenotype prediction had yet to be systematically validated in vivo in any vertebrate.

In our work (Musso et al, online in Development this week) we tested a phenotype prioritization scheme in zebrafish. Zebrafish are fast growing and produce hundreds of progeny, allowing scalable experimentation, thus providing sufficient confirmation of our findings. Also, transparency of embryos has allowed observation of hundreds of developmental phenotypes, giving us a large space of potential phenotypes to predict. We mined an existing public database (www.zfin.org) that catalogues the effect of morpholinos (customizable oligonucleotides that can inhibit transcripts of interest during the first days of development) on hundreds of developmental anatomical processes. We then obtained features and relationships for zebrafish genes (tissue expression, expression from microarray experiments, protein domain information, orthology, and protein & genetic interactions), and predicted for over 15,000 zebrafish genes, which would affect each of 338 developmental processes terms upon knockdown.

The result of our learning procedure was over 5 million gene-phenotype prediction scores. Filtering these scores based on estimates of precision (fraction of predictions which are correct), we were left with thousands of predictions deemed ‘high-confidence’, spanning nearly 100 phenotypes. Even with the scalability of zebrafish, this was too much to evaluate systematically. We decided to focus on cardiovascular phenotypes, picking one anatomical process term broadly describing cardiovascular function. This term performed well, as did dozens of additional terms describing neuronal, sensory, or reproductive phenotypes. We used morpholinos to disrupt the 16 genes scoring above a 95% precision cutoff, screening the bottom-ranked genes as negative controls. Not knowing what phenotypes to expect, we used a broad semi-quantitative scoring system to evaluate cardiac function and morphology post-disruption.

Looking over the results it was instantly clear that test genes were substantially more likely to cause cardiac defects than controls. However, as with any morpholino-based experiment, potential off-target effects were a real concern. After substantial re-screening, we confidently identified 11 genes as causing a cardiac defect upon knockdown. Among these was hspb7, which had been implicated in human heart failure through an unknown mechanism, and tmem88a, which encodes a Wnt-interacting protein but had no known phenotype (at time of submission, several concurrent publications have confirmed the importance of both of these genes during cardiac development).

During publication, we strove to make our prediction results as easily available as possible (in addition to the supplement you can find the predictions at www.genemania.org and http://zfunc.mshri.on.ca). While we hope the zebrafish community finds these predictions helpful, we believe there may be a larger context for these results. While we focused on morpholino-mediated phenotypes, our analysis showed that our predictions were just as effective at identifying mutant effects. Additionally, compared to mammalian model organisms, zebrafish have relatively little gene feature information, so this general strategy should be even more effective at directing experimentation in mammalian models. With large-scale phenotype-quantification efforts underway in multiple model organisms, ‘data overload’ presents a tremendous opportunity to increase the pace of gene function discovery.

(4 votes)

Categories: Research, Resources

Get involved

Create an account or log in to post your story on the Node.

Click here

Sign up for emails

Subscribe to our mailing lists.

Read the latest Development issue

Most-read posts in December

November in preprints
SEBD honours Antonio García-Bellido
Musifying proteins
Development presents… regeneration webinar
preLighters’ choice – November

Molecular Genetics in the age of information overload

Leave a Reply Cancel reply

Get involved

Sign up for emails

Read the latest Development issue

Most-read posts in December

Browse our topic pages

Share

Leave a Reply Cancel reply

Get involved

Sign up for emails

Read the latest Development issue

Most-read posts in December

Browse our topic pages

Share