PGDD is a public database to identify and catalog plant genes in terms of intragenome or cross-genome syntenic relationships. Current efforts focus on flowering plants with available whole genome sequences (preferrably assembled pseudomolecules with ordered gene models).

Data sources

Plant genomes in this database (33 genomes)
Species name Common name Release version Gene number Access Reference
Actinidia chinensis Kiwifruit May 2013
32,670 KGD Nature Communications
Arabidopsis lyrata Lyrate rockcress Version 1.0 (Apr 2011)
32,670 JGI Nature Genetics
Arabidopsis thaliana Arabidopsis TAIR 9.0 (Jun 2009) 27,379 TAIR Nature
Amborella trichopoda Amborella Version 1.0 26,846 Penn State University Science
Brachypodium distachyon Purple false brome Phytozome v6.0 32,255 JGI Nature
Brassica rapa Chinese cabbage Version 1.1
22,285 BRAD Nature Genetics
Cicer arietinum Chickpea Jan 2013 28,269 LIS Nature Biotechnology
Cajanus cajan Pigeonpea Nov 2011 48,680 IIPG Nature Biotechnology
Carica papaya Papaya Dec 2007 25,536 Hawaii Nature
Chlamydomonas reinhardtii Green algae Version 4.2
JGI Science
Cucumis sativus Cucumber Phytozome v6.0
Nature Genetics
Eucalyptus grandis Eucalyptus Version 1.1
Fragaria vesca Strawberry Dec 2010
PFR Nature Genetics
Glycine max Soybean 1.1 (Jun 2013) 66,153
JGI Nature
Gossypium raimondii Cotton Version 2.1 37,505
JGI Nature
Lotus japonicus Lotus Release 2.5 42,399 Kazusa DNA research
Musa acuminata Banana Jul 2012 36,542 CIRAD Nature
Malus x domestica Apple Aug 2010
IASMA Nature Genetics
Medicago truncatula Barrel medic Mt3.5 v3 (Jun 2011)
JCVI Nature
Oryza sativa Rice Mar 2013 35,679 RAP Nature
Physcomitrella patens Moss Version 1.6 (Jan 2008)
32,272 JGI Science
Prunus persica* Peach Version 1.0
27,864 JGI -
Populus trichocarpa Western poplar JGI 2.0 (Feb 2010) 45,778 JGI Science
Phaseolus vulgaris Common bean Version 1.0 27082 JGI Nature Genetics
Ricinus communis Castor bean Release 0.1 (May 2008) 38,613 JCVI Nature Biotechnology
Sorghum bicolor Sorghum Sbi 1.4 (Dec 2007) 34,496 JGI Nature
Solanum lycopersicum Tomato Version 2.3
34,727 SGN Nature
Selaginella moellendorffii Selaginella Version 1.0 (Dec 2007)
22,273 JGI Science
Solanum tuberosum Potato Version 3.4
39,031 PGSC Nature
Theobroma cacao Cacao Release 0.9 (Sep 2010) 28,798
CIRAD Nature Genetics
Utricularia gibba Humped bladderwort CoGe (Jun 2013) 28,494
CoGe Nature
Vitis vinifera Grape vine Genoscope (Aug 2007) 26,346
Genoscope Nature
Zea mays Maize Release 5a (Nov 2010) 32,540 AGI Science


  • 06-25-2014  Common bean (Phaseolus vulgaris) was added to PGDD.
  • 06-18-2014  Eucalyptus (Eucalyptus grandis) was added to PGDD.
  • 11-07-2013  Kiwifruit (Actinidia chinensis) was added to PGDD.
  • 09-20-2013  Soybean (Glycine max) was revised with the new verion (ver. 1.1) of genome data.

The duplication history of plants in PGDD

*Branch lengths do not represent time or relative amount of character change. 01 02 03 04 05 07 08 09 11 12 13 14 15 16 17 18 20

References: Common tree taxonomy tool at NCBI, Document about plant paleopolyploidy at CoGe and Phylogenetic tree of species in Phytozome


Identify syntenic blocks

We used BLASTP to search for potential anchors (E <1e-5, top 5 matches) between every possible pair of chromosomes in multiple genomes. The homologous pairs are used as the input for MCscan. MCscan is a novel synteny search program that combines the merits of two existent algorithms. The built-in scoring scheme for MCscan is min {-log10E, 40} for every matching gene pairs and -1 for each 10kb distance between anchors, similar to DAGchainer and blocks that have scores >200 were kept. The resulting syntenic chains are evaluated using a procedure in ColinearScan and E-value <1e-10 were used as a significance cutoff.

Calculate synonymous substitutions (Ks)

For homologs inferred from syntenic alignments, we aligned the protein sequences of the gene pairs using CLUSTALW and used the protein alignments to guide CDS alignments by PAL2NAL. Finally, we used Nei-Gojobori method implemented in the PAML package to calculate Ks. An in-house python script is used to pipeline all the calculations. Log-gaussian mixture models are fitted to the Ks distributions using GMM with Bayes Factors.

How can I calculate the significance of segmental duplication

Number of collinear genes: Total number of genes in genome:
Spread in location A: Spread in location B:

How to cite