Datataxa: a new script to extract metadata sequence information from GenBank, the Flora of Bajío as a case study

  • Eduardo Ruiz-Sanchez Departamento de Botánica y Zoología, Centro Universitario de Ciencias Biológicas y Agropecuarias, Universidad de Guadalajara. Zapopan, Jalisco
  • Carlos Alonso Maya-Lastra Columbia University
  • Victor W. Steinmann Facultad de Ciencias Naturales, Universidad Autónoma de Querétaro, Campus Juriquilla, Juriquilla, Querétaro
  • Sergio Zamudio Investigador Independiente
  • Eleazar Carranza Instituto de Investigación de Zonas Desérticas, Universidad Autónoma de San Luis Potosí, San Luis Potosí
  • Rosa María Murillo Investigador Independiente
  • Jerzy Rzedowski Centro Regional del Bajío, Instituto de Ecología, A.C., Pátzcuaro, Michoacán
Keywords: API, checklist, Entrez, floristic treatment, Flora del Bajío y de Regiones Adyacentes, GenBank, vascular plants


Background: GenBank is a public repository that houses millions of nucleotide sequences. Several software have been developed to extract information stored in GenBank. However, none of them are useful to extract and organize GenBank accession based on metadata. We developed a new script called Datataxa, which works to mine GenBank information. The checklist of the Flora del Bajío y de Regiones Adyacentes (FBRA) was used as a case study to apply our script.

Questions: How many species occurring in the FBRA have records in GenBank? What percentage of those records have been used for phylogenetic, phylogeographic, phylogenomic, barcoding, genetic diversity, and biogeographic studies?

Methods: Datataxa was written in AutoIt Scripting Language in order to facilitate the extraction of information from GenBank. This information was classified in six study categories. A checklist of species published fascicles of FBRA was used as study case to apply our new script, and the previous categories were applied to the FBRA species list.

Results: The script allowed us to search for meta information, like publication titles, for 2,558 species that were included in the FBRA. Of these, 1,575 had a least one record in GenBank. A total of 1,322 species were used in phylogenetic studies, followed by barcoding studies (326) and biogeographic studies (298). Phylogenomic (41), phylogeographic (34), and diversity studies (34) were the least represented.

Conclusions: Datataxa was useful for mining metadata sequence information from GenBank and can be used with any list of species to get the GenBank accessions’ metadata.


Download data is not yet available.

Author Biography

Carlos Alonso Maya-Lastra, Columbia University

Deparment of Ecology, Evolution and Environmental Biology

Postdoctoral Researcher

Datataxa: a new script to extract metadata sequence information from GenBank, the Flora of Bajío as a case study


Bennett J. 2015. AutoIt Script Homepage. <> (Accessed 5 May 2018).

Bennett D, Hettling H, Silvestro D, Zizka A, Bacon C, Faurby S, Antonelli A. 2018. phylotaR: An automated pipeline for retrieving orthologous DNA sequences from GenBank in R. Life, 8: E20. DOI:

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. 2006. GenBank. Nucleic acids research, 36: D16-D20. DOI:

Calderón de Rzedwoski G, Rzedowski J. 1991. Flora del Bajío y de Regiones Adyacentes. Fascículo complementario I. Instituto de Ecología, A.C., Centro Regional del Bajío. Pátzcuaro, México. 14 pp.

Calderón de Rzedoski G, Rzedowski J. 1997. Velascoa (Crossosomataceae), un género nuevo de la Sierra Madre Oriental de México. Acta Botanica Mexicana 39: 53-59. DOI:

Chan CX, Ragan MA. 2013. Next-generation phylogenomics. Biology Direct 8: 1-6. DOI:

NCBI Resource Coordinators. 2016. Database resources of the national center for biotechnology information. Nucleic Acids Research 44: D7-D19. DOI:

Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics 6: 361-375. DOI:

Eisen JA, Fraser CM. 2003. Phylogenomics: intersection of evolution and genomics. Science 300: 1706-1707. DOI:

Ferrari L, Orozco-Esquivel T, Manea V, Manea M. 2012. The dynamic history of the Trans-Mexican Volcanic Belt and the Mexico subduction zone. Tectonophysics 522-523: 122-149. DOI:

Ferrusquia-Villafranca I. 1993. Geology of Mexico: a synopsis. Biological diversity of Mexico: origins and distribution. In: Ramamoorthy TP, Bye R, Lot A, Fa J, eds. Biological Diversity of Mexico: Origins and Distribution. New York: Oxford University Press, 3-107. ISBN-13: 978-0195066746; DOI:

Funk VA. 2006. Floras: a model for biodiversity studies or a thing of the past? Taxon, 55: 581-588.

Gómez-Tuena A, Orozco-Esquivel MT, Ferrari L. 2007. Igneous petrogenesis of the Trans-Mexican Volcanic Belt. Geological Society of America Special Paper 422: 129-181. DOI:

Hajibabaei M, Singer GA, Hebert PD, Hickey DA. 2007. DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics. Trends in Genetics 23: 167-172. DOI:

Hebert PD, Cywinska NA, Ball SL, deWaard JR. 2003. Biological identifications through DNA barcodes. Proceedings of the Royal Society of London B 270: 313-321. DOI:

Kier G, Mutke J, Dinerstein E, Ricketts TH, Küper W, Kreft H, Barthlott W. 2005. Global patterns of plant diversity and floristic knowledge. Journal of Biogeography 32: 1107-1116. DOI:

Lughadha EN, Govaerts R, Belyaeva I, Black N, Lindon H, Allkin R, Magill RE, Nicolson N. 2016. Counting counts: revised estimates of numbers of accepted species of flowering plants, seed plants, vascular plants and land plants with a review of other recent estimates. Phytotaxa 272: 82-88. DOI:

Maglott D, Ostell J, Pruitt KD, Tatusova T. 2005. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research 33: D54-D58. DOI:

Mason CT. 1975. Apacheria chiricahuensis: a new genus and species from Arizona. Madroño 23: 105-108.

Maya-Lastra CA. 2019. Datataxa v.U. Available from: <> (accessed January 21, 2019)

McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT. 2013. Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and Evolution 66: 526-538. DOI:

Palmer MW, Richardson JC. 2012. Biodiversity data in the information age: Do 21st century floras make the grade? Castanea 77: 46-59. DOI:

Rzedowski J. 1978. Vegetación de México. México, D.F.: Limusa.

Rzedowski J. 1991. Diversidad y orígenes de la flora fanerogámica de México. Acta Botanica Mexicana 14: 3-21.

Sanderson MJ, Driskell AC. 2003. The challenge of constructing large phylogenetic trees. Trends in Plant Science 8: 374-379. DOI:

Sanderson MJ, Boss D, Chen D, Cranston KA, Wehe A. 2008. The PhyLoTA Browser: processing GenBank for molecular phylogenetics research. Systematic Biology 57: 335-346. DOI:

Smith SA, Brown JW. 2018. Constructing a broadly inclusive seed plant phylogeny. American Journal of Botany 105: 302-314. DOI:

Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu YL, Chase MW, Farris JS, Stefanović S, Rice DW, Palmer JD, Soltis PD. 2004. Genome-scale data, angiosperm relationships, and ‘ending incongruence’: a cautionary tale in phylogenetics. Trends in Plant Science 9: 477-483. DOI:

Soltis DE, Moore MJ, Sessa EB, Smith SA, Soltis PS. 2018. Using and navigating the plant tree of life. American Journal of Botany 105: 287-290. DOI:

Sosa V, Chase MW. 2003. Phylogenetics of Crossosomataceae based on rbcL sequence data. Systematic Botany 28: 96-105.

Suárez-Mota ME, Villaseñor JL, López-Mata L. 2015. La región del Bajío, México y la conservación de su diversidad florística. Revista Mexicana de Biodiversidad 86: 799-808. DOI:

The Angiosperm Phylogeny Group (APG), Chase MW, Christenhusz MJM, Fay MF, Byng JW, Judd WS, Soltis DE, Mabberley DJ, Sennikov AN, Soltis PS, Stevens PF. 2016. An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG IV. Botanical Journal of the Linnean Society 181: 1-20. DOI:

Ulloa-Ulloa C, Acevedo-Rodríguez P, Beck S, Belgrano MJ, Bernal R, Berry PE, Brako, Celis M, Davidse G, Forzza RC, Gradstein SR, Kokche O, León B, León-Yánez S, Magil RE, Neil DA, Nee M, Rave PH, Stimmel H, Strong MT, Villaseñor JL, Zarucchi JL, Zuluoaga FO, Jørgensen PM. 2017. An integrated assessment of the vascular plant species of the Americas. Science 358: 1614-1617. DOI:

Wen J, Ickert-Bond SM, Appelhans MS, Dorr LJ, Funk VA. 2015. Collections-based systematics: Opportunities and outlook for 2050. Journal of Systematics and Evolution 53(6), 477-488. DOI:

How to Cite
Ruiz-Sanchez, E., Maya-Lastra, C. A., Steinmann, V. W., Zamudio, S., Carranza, E., Murillo, R. M., & Rzedowski, J. (2019). Datataxa: a new script to extract metadata sequence information from GenBank, the Flora of Bajío as a case study. Botanical Sciences, 97(4), 754-760.