Special Report: The GO Reference Genome Project - A Unified Framework for Functional Annotation across Species

The Reference Genome Project of the GO Consortium aims to comprehensively annotate all the gene products from human, as well as that of eleven important model organisms: Arabidopsis thaliana, Caenorhabditis elegans, Danio rerio, Dictyostelium discoideum, Drosophila melanogaster, Escherichia coli, Gallus gallus, Mus musculus, Rattus norvegicus, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. These different species are being used to model various, complementary aspects of biology. Importantly, each of these organisms are supported by a manually annotated database with local expertise in GO annotation. Collectively these twelve species are referred to as the "GO Reference Genomes". We welcome the contributions of other groups who would provide GO annotations that conform to the rigorous standards set by the participating groups of the Reference Genome project.

Partial Graph of Gene POLA

Our ambitious goal of 'comprehensive annotation' refers to the goal of annotating all genes of each genome to the deepest possible detail available through experimental results. To achieve this, we have developed an approach that superposes experimentally-based annotations onto the leaves of phylogenetic trees. We then assign the function of the common ancestors based on the assumption that the functions held in common at the leaves can be inferred to be present in those ancestors and are likely to be conserved in all other descendents of each family. We are using trees generated by the PANTHER project based on standardized protein-coding gene sets from the twelve genomes, as well as protein sequences from 34 other species, to provide a more complete phylogenetic spectrum. The quality of the trees was assessed by comparing the trees to “ortholog clusters” generated by the OrthoMCL algorithm for the same protein sets. The agreement was very good overall: of the 412 OrthoMCL clusters covering the comprehensively annotated Reference Genome genes, 387 (94%) were consistent with the tree sets.

The prioritization of curation targets for the Reference Genome project is based on the following principles:

(i) Genes whose products are highly conserved during evolution, e.g., the gyrase/topoisomerase II gene family conserved from bacteria to human
(ii) Genes known to be implicated human disease, e.g., the MutS homolog gene family, whose human ortholog is involved in a hereditary form of colorectal cancer in humans
(iii) Genes whose products are involved in biochemical and signaling pathways, e.g., the PYGB phosphorylase that participates in glycogen degradation
(iv) Genes identified from recently published literature as having an important or new scientific impact, e.g., POU5F1 (POU class 5 homeobox 1 gene) that is important for stem cell function.

This prioritization promotes the comprehensive annotation of genes of high relevance to current research efforts.

As of November 2008, we have annotated approximately 4,000 gene products. These genes have a higher percentage of annotations derived from published experimental research than other GO annotated gene products. Initially, 34% of the 4,000 genes had annotations supported by experimental data. Now, 71% of the genes are so supported, a 2-fold increase; while a randomly selected sample of the same number of genes from the rest of GO annotation set, has only 52% increase in experimental annotation, a 1.5-fold increase.

The paper describing this project and the preliminary results is in press in PLoS Computational Biology.

by Pascale Gaudet (dictyBase, Northwestern University)