Various mechanisms (origination types) including duplication and de novo origination create new genes, which in turn drive phenotypic evolution. However, these species- or lineage-specific genes are narrowly transcribed and poorly conserved. Those created by segmental duplication and de novo origination usually encode short proteins. Such features cause the lability of new gene annotation in the mainstream practice (e.g. Ensembl). Specifically, for 1,828 primate-specific coding genes (PSGs, hereafter new genes) annotated in Ensembl v51, 61% were reannotated as noncoding RNAs, pseudogenes or deleted in Ensembl v73. By contrast, the biotype was changed for only 8% of protein coding genes predating primate split. The annotation instability of de novo genes is particularly serious for which a mere percent (8% or 5/60) of previously identified candidates were kept as coding genes in Ensembl v73. In order to tackle this problem, we developed this web interface, GenTree, to help infer gene age and refine gene models. On the one hand, we took advantage of the vertebrate whole-genome alignments provided by UCSC and inferred genes' origination time. We used self whole-genome alignment and all-against-all protein alignment to infer the origination mechanism. Published age datasets including phylostratigraphy and ProteinHistorian were cross-referenced. On the other hand, sequencing based transcriptome data were imported to indicate the functional potential of candidate primate-specific genes, while mass spectrum based proteome information was integrated to help differentiate coding and noncoding genes. Evolutionary inference is also integrated to aid gene function annotation. Actually, at least 40 Ensembl annotated pseudogenes are protein-coding as supported by mass spectrum, while 2 pseudogenes supported by selection signals. By integrating evolutionary methods and proteogenomics, we curated a dataset of 263 PSGs with coding evidence out of the 837 PSGs, which could be accessed through PSG Gateway. Currently, GenTree only consists of human data and the future update will cover model organisms such as mouse and fruitfly.
Dating Strategy
We adopted genome-alignment based pipeline to infer the origination time of a given genomic region, described in Zhang E. Y., Long M. and et al.(2010). We analyzed UCSC netted chained file for human to verify whether a given human gene contains a locus which has a reciprocal syntenic alignment in 22 outgroup genomes including chimpanzee, mouse, chichen and so on. In other words, we investigated whether a best-to-best match could be found between human loci and outgroup loci regardless of chromosomal linkage. In this way, orthologous loci, even those with different chromosomal location due to fusions or translocations, can be identified. In order to handle occational sequencing gaps, we scanned multiple outgroups for Clupeocephala, Sauria, Laurasiatheria and Glires. We assigned a locus to a specific branch by following a modified dollo parsimony rule. Ensembl release 73 downloaded as the basic gene dataset, we date each gene according to the oldest locus within.
Origination Mechanism Identification
As with Dating strategy, we identified the origination type following Zhang E. Y., Long M. and et al.(2010). New genes are classified as DNA-level duplicates, RNA-level duplicates (retrogenes) and de novo genes. Briefly, we performed all-against-all BLASTP search for human proteins. It was reported previously that retrogenes can recruit other neighboring genome regions with introns after being retroposed. Thus, in order to define a new gene as retrogene, we requested that in the aligned region between the most similar paralog (candidate parental gene) and child genes, the former should contain at least one intron and the latter to be intronless. Otherwise, it will be classified as DNA-level duplicates. Notably, if there is no hit with BLAST evalue cutoff 10-6 or aligned fraction greater than 0.7 with identity greater 0.5 found and no annotated paralog by Ensembl, the gene will be defined as de novo.
User Guide:
Browse
Users may access genes through Browse Page. The grey bars on the x-axis indicate the chromosomes. Y chromosome and mitochondrion are not included because no genes on those chromosomes could be dated for sure. Scaffolds that failed to assembled on the main chromosomes are also excluded for the same reason. The legend at the bottom could be used to set the visibility of genes on different branches. Each dot represents a gene and links to the page that show evolutionary and functional details of that gene. The dots on the left side and right side of the chromosome, which denotes the strand of the gene, are clickable. with the mouse over the chromosome, a summary of genes on that chromosome is shown in the tooltip. The top right piechart shows the proportion of currently visible genes with different age/branch, where tooltip could also be triggered. Note the default page shows only PSGs for the consideration of loading speed (see figure below). The non-PSGs could be lighted by clicking the legend at the bottom. Users could also search for genes with certain GO term or InterPro domain.
Text Search
Users may approach genes through Text Search. Correct and complete Ensembl ID is preferred. For example: 'ENSG00000007129'. Gene symbols approved by HGNC are also supported. Boolean mode is helpful when search against gene's description provided by Ensembl. For example: searching '+insulin -receptor' returns genes whose description contains 'insulin' but excludes those with 'receptor'. In order to search for ZNF genes, type in 'ZNF*'. For more details about boolean mode.
Location-based Search
Users may specify genomic region of interest to search for genes via Location-based Search. UCSC-like genomic region format is preferred, like: 'chr1:1-100000'. Another format that denotes the start position and length of region is also allow. For example: 'chr1:1+99999' is equivalent to 'chr1:1-100000'.
Sequence-based Search
Users with nucleotide sequence or amino acid sequence may adopt Sequence-based Search. Human protein and transcript database are provided. Note that inproper method would lead query to fail. The color keys and the hyperlinks of 'score' field are linked to the details of alignment. The hyperlinks in 'Hit Sequence' field are linked to evolutionary and functional details of genes for each hit. Sequence in FASTA format or in bare sequence are accepted.
PSG Gateway
As mention before, we curated a dataset of 263 PSGs with coding evidence out of the 837 PSGs, based on integrating evolutionary methods and proteogenomics. For coding genes, we examined whether they are supported by UniProt annotation ("evidence at the protein level"), GO annotation (GO terms corresponding to experimental evidence codes including: TAS, IPI, IDA, EXP, IMP, IC, IEP and IGI), proteogenomics (at least two uniquely mapping spectra) and various Ka/Ks tests (significant or not). Notably, the orthologous alignment based Ka/Ks test were performed for genes with the age of branch 0-9.
Description:
Branch & Branch View
There are 14 (0-13) branches in current age dating system. The younger the gene is, the greater number branch it shall be assigned. As shown in the figure, these branches are ladderized on the left. Mouse over branch lines will trigger tooltip that denotes the taxon name and estimated time from Time Tree. Gene origins from the common ancestor on the branch where the red node is (for this gene, branch 12 means gene gain event occurs before human and chimpanzee split). Black species lines denotes reciprocal alignment could be found in this speceis, while grey lines for nonsyntenic alignment (for this gene, reciprocally alignment only found in chimpanzee. Meanwhile, nonsyntenic alignment found in the rest). If you put your mouse over the species lines, a tolltip will present more detailed information, like in the figure below.
Dubious
Sometimes, the branch is colored red and followed by notation '(Dubious)'. 'Dubious' denotes the age dating of this gene is somehow unreliable. Reasons including: gene is located in un-assembled scaffold or repeat region which make the alignment unreliable, gene is not annotated as protein-coding gene or phylogenetic alignment present odd pattern. Genes located on chrM and chrY are also noted as 'Dubious' because evolution trajectory of genes on these chromosomes are complex, hence results derived from automatic pipeline should be taken with special caution. For example: SRGAP2C is annotated as pseudogene by Ensembl release73, despite that Dennis M. Y., Eichler E. E. et al.(2012).
Origination Type
As stated previously, origination type is based on all-against-all BLASTP search for human proteins. Although a relatively loose (actually strict) criterion is adopted to allow highly sensitive identification of homologues, which in turn decrease the false positive rate for de novo created genes, much more strict criterion for DNA-level duplicate gene and RNA-level duplicate classification is adopted. This more strict criterion is set to alleviate false positive error. However, this operation may introduce potential false negative error. There are many genes whose homology is hard to decide. Hence we classified those genes that fall into the gray interim as 'DNA-level duplication likely' or 'RNA-level duplication likely'. For this pipeline would allow those somewhat similar protein-coding genes to be present, One should pay special attention when origination mechanism contains information 'likely'. For example: ENSG00000007129.
Paralogous Alignment
Paralogous Alignment panel shows how the Origination Type is identitied by exhibiting the alignment. There are two gene models displayed in the graph. Blue upper is focal gene, the gene of interest; and black lower is paralogous gene. The range of scales are the same for both genes. Rectangles are exons and introns are displayed by the spaces between. Blocks filled with color are CDS and white blocks are UTR region. Left is 5 prime UTR and right is 3 prime UTR. As origination mechanism is based on all-against-all BLASTP search for human proteins, alignment shall only cover the green rectangles. The alignment is shown by the green blocks in the middle, which represents the protein. If your put your mouse over the green blocks, detailed alignment shall be presented with red letters denotes mismatches and gaps. Tooltips are also equipped for the gene model blocks. The polygons shows how protein sequence is corresponding to the genomics sequence. The figure can be zoomed in by selecting a limited space. Paralogous Alignment panel can be used to infer origination type in a more vivid manner. For example: ENSG00000182890 as shown in the figure, coding region of paralogous gene can almost perfectly aligned to focal gene, which suggests that the homology inference is highly reliable. As the paralogous gene contains 11 introns, whereas focal gene have no intron, this focal gene is most likely to be a RNA-level duplication.
Transcriptome Profile
Transcriptome profile is present to show evidence of expression of this gene. Current dataset is from The Human Protein Atlas. Processd with TrimmomaticBowtie 2 and RSEM, transcriptome profile directly suggest which tissue the gene of interest is expressed abundantly.
Gentree also shows the pre-calculated RPKM from Brain Span. We summarized the data by their tissue, stages and gender to provide better visualization of the data. Users can easily browse the transcriptome profile in brain and switchs view mode by buttom.
Proteomic Profile
Proteomic Profile evidence suggests translation potential. For each peptide translates and decays in different level within each sample, the number and color denotes the abundance vividly in the heatmap. We only present at most 5 uniquely mapped peptides detected in at most 10 tissue in the page for elegant visualization. The most enriched tissues are present in the page, with most abundant on the left. It is the same with peptide, with most abundant detected peptide on the top. Users can click hyper link in the subtitle for more details about tissue summary, peptide summary and all entries.
PhyloStratiGraphy
Domazet-Loso T., Tautz D. et al. 2007 coined the term phylostratigraphy and provide the phylostratigraphic pipeline to date the emergence of genes and gene families. Their method is based on BlastP of focal species proteins against National Center for Biotechnology Information (NCBI) non-redundant (NR) database. This protein alignment based dating strategy is more suitable than genome alignment for deeply diverged species, because protein is more conserved. Although widely different from our pipeline, phylostratigraphy represents efforts from other teams to explore evolutionary innovations and is widely cited. Phylostratigraphy of human (Domazet-Loso T., Tautz D. et al. 2008) is referred here for support/comparison, with the aim that users may get better understanding of the evolutionary trajactory of the gene. The figure below shows an example of an old gene whose emergence took place before cellular organisms' common ancestor split.
Protein Historian
Protein Historian is another reference we include to better illustrate the evolution of the gene. As stated by themself: Protein Historian provides pre-computed ages for many species based on several external databses of protein families and two ancestral family reconstruction algorithms. There are five datasets for human: PPODv4_Jaccard_families, PPODv4_MultiParanoid_families, PPODv4_Nens_families and PPODv4_OrthoMCL_families provide proteins only in 12 species, while PPODv4_PTHR7-OrthoMCL_families contains proteins in 48 species (shall be included latter). The line width is proportion to the amount of members in this gene family, as shown in the figure below (tooltip of the species line can provide more details for this gene). The two ancestral family reconstruction algorithms are Dollo parsimony and Wagner parsimony. This makes 10 estimations for one gene (2 of them will be updated latter). It is common that age predictions based on distinct datasets or divergent algorithms are different. For example: ENSG00000007129 show different age inference for different dataset. For more details, please visit Protein Historian.
FAQ:
Why gene's paralogous gene is also its focal gene?
As stated previously, origination type inference is based on all-against-all BLASTP result guided by self whole genome alignment. What has been shown in the website and the dataset is the most similar paralog found for duplicated genes. It is normal that two genes are most similar to each other. Users who would like to perform analysis on duplicated genes with clear parent and child relationship, should take advantage of age information. In other words, only those parent-child relationship whose parent gene has a more ancient origination than its child gene, tells the parent-child relationship unambiously. In some of these cases, both genes originated on the same branch, where only equivocal parent-child relationship could be provided.