Datataxa: a new script to extract metadata sequence information from GenBank, the Flora of Bajío as a case study

  • Eduardo Ruiz-Sanchez Departamento de Botánica y Zoología, Centro Universitario de Ciencias Biológicas y Agropecuarias, Universidad de Guadalajara. Zapopan, Jalisco
  • Carlos Alonso Maya-Lastra Columbia University
  • Victor W. Steinmann Facultad de Ciencias Naturales, Universidad Autónoma de Querétaro, Campus Juriquilla, Juriquilla, Querétaro
  • Sergio Zamudio Investigador Independiente
  • Eleazar Carranza Instituto de Investigación de Zonas Desérticas, Universidad Autónoma de San Luis Potosí, San Luis Potosí
  • Rosa María Murillo Investigador Independiente
  • Jerzy Rzedowski Centro Regional del Bajío, Instituto de Ecología, A.C., Pátzcuaro, Michoacán
Keywords: API, checklist, Entrez, floristic treatment, Flora del Bajío y de Regiones Adyacentes, GenBank, vascular plants


Background: GenBank is a public repository that houses millions of nucleotide sequences. Several software have been developed to extract information stored in GenBank. However, none of them are useful to extract and organize GenBank accession based on metadata. We developed a new script called Datataxa, which works to mine GenBank information. The checklist of the Flora del Bajío y de Regiones Adyacentes (FBRA) was used as a case study to apply our script.

Questions: How many species occurring in the FBRA have records in GenBank? What percentage of those records have been used for phylogenetic, phylogeographic, phylogenomic, barcoding, genetic diversity, and biogeographic studies?

Methods: Datataxa was written in AutoIt Scripting Language in order to facilitate the extraction of information from GenBank. This information was classified in six study categories. A checklist of species published fascicles of FBRA was used as study case to apply our new script, and the previous categories were applied to the FBRA species list.

Results: The script allowed us to search for meta information, like publication titles, for 2,558 species that were included in the FBRA. Of these, 1,575 had a least one record in GenBank. A total of 1,322 species were used in phylogenetic studies, followed by barcoding studies (326) and biogeographic studies (298). Phylogenomic (41), phylogeographic (34), and diversity studies (34) were the least represented.

Conclusions: Datataxa was useful for mining metadata sequence information from GenBank and can be used with any list of species to get the GenBank accessions’ metadata.


Download data is not yet available.

Author Biography

Carlos Alonso Maya-Lastra, Columbia University

Deparment of Ecology, Evolution and Environmental Biology

Postdoctoral Researcher

How to Cite
Ruiz-Sanchez, E., Maya-Lastra, C. A., Steinmann, V. W., Zamudio, S., Carranza, E., Murillo, R. M., & Rzedowski, J. (2019). Datataxa: a new script to extract metadata sequence information from GenBank, the Flora of Bajío as a case study. Botanical Sciences, 97(4), 754-760.