The Mouse ENCODE Project released a slew of papers late last month reporting findings from a three-year effort to comprehensively map functional elements in the mouse genome. Their major findings are summarized in an integrative paper in Nature
(Yue, F. et. al., 2014
). Similar to the goals of the human ENCODE project (The ENCODE Project Consortium, 2012
and the ENCODE portal
), mouse ENCODE aimed to identify biochemically functional elements in order to better understand and design genomic and genetic studies in this important model organism (Mouse ENCODE Consoritum, et. al., 2012
This isn’t the ENCODE groups’ first model organism — if you missed the fly and worm papers, they came out in January 2011 (The modENCODE Consortium, et. al, 2012
; Gerstein, M.B. et al., 2012
). The modENCODE Project had the similar goal of elucidating functional elements in these two important genetic systems. Mouse ENCODE does the same thing but with the advantage of much more closely matched tissue and cell types to human, with an eye to making mouse models of disease more genomically interpretable.
The human effort emphasized their selection of cell lines or primary tissues to create the most versatile data resource: one with many experiments in a few select cell types, and a few very informative experiments across many (100s) of cell types. The mouse ENCODE Project organized themselves similarly, but can also take advantage of analyzing tissues over a range of developmental time points, an axis more difficult to study in human. In total the group generated over 1000 datasets spanning cells, tissues, and time points that go into the integrative analysis paper (Supplementary Table 1). Companion papers break down subsets of the data into more concentrated efforts to understand domains of replication timing (Pope, B.D. et. al., 2014
), or regulatory and transcriptional conservation and divergence between human and mouse (Cheng, Y. et. al., 2014
, Sundaram, V. et al., 2014
, Stergachis, A.B. et. al., 2014, Lin, S. et. al., 2014)
Main points of paper
There have been many good overviews of the mouse ENCODE main integrative and companion papers since their publication in the Nov 20 issue of Nature
(Carnici, P. 2014
), and the authors of the integrative paper offer a bullet point summary of findings after the introduction. Here I will go into more detail on two sections of the paper: conservation of expression patterns and the conservation of cis-regulatory elements. My purpose is to break down the high-level descriptions of methods that don’t get much space in the main text of the paper and to make accessible the usefulness of both methods and data in this seminal paper.
The field has been split on wether expression varies with tissue or is more closely tied to species, and there are data on both sides. Previous studies have found that expression is more tightly correlated with species; others find gene expression profiles between the same cell type in different species is more similar than different cell types in the same species. When Mouse ENCODE looked at expression of genes across many tissues and compared them to human, they found both are correct: there is evidence that gene expression is mostly conserved, but there is biological pathway-specific divergence. The principle component analysis in Fig 2a shows that overall gene expression tends to vary with tissue type, but that there is some variation due to species (principle component 2). Fig 2b pinpoints which genes vary more with species or cell type changes. Importance of this analysis is to reveal which genes/sets of (orthologous) genes are more species specific, and therefore may not be the best markers for disease model studies.
One major question in comparative genomics is if gene expression is conserved across species. The analysis so far suggests that yes, for some orthologs. To quantify this, the authors determine the conservation of co-expression using a novel method they call Neighborhood Analysis of Conserved Co-expression (NACC). The idea is to determine for a pair of orthologus genes if they have a similar correlation of expression with a set of “neighborhood genes” in each species. For a test gene X in human, the NACC method determines the Euclidean distance of the test gene to a set of “neighborhood genes” (also with orthologs in mouse). Then they quantify the same distance metric between the ortholog of test gene X in mouse and the orthologs of the neighborhood genes. The inverse analysis gives them two sets of Euclidean distances for each orthologous gene. Then the average change in both direction is a “symmetric measure of the degree of conservation of co-expression for each gene.” Fig 2c shows that orthologs tend to have more similar distances to their set of orthologous neighbor genes, quantified by a small deltaD (on the x-axis) compared to test genes randomly paired (not based on orthology). This suggests that most orthologous genes have conserved co-expression between species.
The authors can extend their NACC method to quantify correlation of any orthologous regions, for example, regulatory elements. They do this for H3K27ac peaks and DNaseI signal (Fig 4) and find that “most sequence alienable regulatory elements are conserved in activity”.
Cis-regulatory conservation and divergence
By sequence homology of regulatory elements (defined by chromatin mark data), Mouse ENCODE determines that between one-half to two-thirds of regulatory elements are conserved between human and mouse. Next they ask if divergent (species-specific) regulatory elements are enriched near genes for certain biological processes. This enrichment analysis reveals that mouse-specific regulatory elements tend to be located near genes for immune processes (Fig. 3c). This agrees with the conservation of co-expression analysis, which found that immune genes tend to have lower conservation of co-expression (Fig 2d). Together these data suggest that regulation of immune function in mouse is regulated distinctly from human.
If some processes, such as immune function, are regulated distinctly, how are those species-specific elements generated? They explore how elements may have been added or removed from the mouse genome by examining the overlap of enhancer elements with repetitive regions. 85% of mouse-specific enhancers overlap with a repeat element, more than expected by chance. Delving further, this enrichment is pronounced in specific subfamilies of mobile elements. Specifically, mouse-specific transcription factor binding sites are encroached in short interspersed elements (SINEs) and long terminal repeats (LTRs).
Contribution of mobile elements to gene regulatory networks has been a hypothesis in the field of gene regulation since Barbara McClintock (McClintock, B., 1956
). The hypothesis has gained more evidence recently (see: Wang, T. et. al., 2007
; Bourque, G. et. al., 2008
; and Jacques, P. et. al., 2013
). However, large-scale analysis of this hypothesis is enabled by advantages of longer sequencing technologies that can resolve repetitive regions and the abundance of data resources like the ENCODE Projects. The specific transcription factor enrichments across mobile elements in mouse is explored in more depth in Sundaram, V. et al., 2014
Usefulness to developmental biology
The Mouse ENCODE goal of breaking down the genome into the functional parts should be very informative for the developmental biologist who wants to understand the regulatory structure of her favorite gene. The breadth and depth of sample types and assays means a biologist can visualize the chromatin structure of his/her gene(s) of interest in a relevant cell type or tissue.
To do this of course you’ll need to access the data. You can find it all through genome browser portals on the Mouse ENCODE website (mouseencode.org
). You can also find the data hosted at the Wash U Epigenome Browser at epigenomebrowser.wustl.edu/browser
Another aspect of the Mouse ENCODE work (and ENCODE in general) I’ve been thinking about is how these projects are improving our ability to predict functional elements with less information. In my mind this means three questions:
- What are the criteria that tell us region a is X type of functional element?
- How do we know if we have found all (or some % ) of X functional elements?
- What is the most informative piece of data that can tell us if region b is X element vs Y element?
Most of these questions were addressed to a large extent in the ENCODE integrative paper published in 2012
. In model organisms though, the more important metric is how to the identified functional elements relate to human and impact how we think about our models of human development and disease. A companion paper to the Mouse integrative paper starts to address this question (Cheng, Y. et. al., 2014
). I encourage you to read it, but I can’t stop thinking about Figure 3c — images of mouse embryos expressing a reporter construct of an identified functional element. The punchline is half of the reporters are tissue-specific and half are not — and the authors interpret this as “pleiotropic”. What is unanswered is what is the defining feature that determines tissue-specificity vs. pleiotropy? That is what the developmental biologist wants to know so she can prioritize candidate regulatory elements to understand the molecular genetics of her developmental or disease process more precisely.
More companion papers