Advancements in DNA sequencing and computational technologies influenced almost all areas of biological sciences. DNA barcoding technology employed for generating nucleotide sequences (DNA barcodes) from standard gene region(s) is capable of resolving the complexities caused due to morphological characters. Thus, they complement taxonomy, population analysis, and phylogenetic and evolutionary studies. DNA barcodes are also utilized for species identification from eggs, larvae, and commercial products. Sequence similarity search using Basic Local Alignment Search Tool (BLAST) is the most reliable and widely used strategy for characterizing newly generated sequences. Similarity searches identify “homologous” gene sequence(s) for query sequence(s) by statistical calculations and provide identity scores. However, DNA barcoding relies on diverse DNA regions which differ considerably among taxa. Even, region-specific variations within barcode sequences from a single gene leading to “nonhomology” have been reported. This causes complications in specimen identification, population analysis, phylogeny, evolution, and allied studies. Hence, the selection of appropriate barcode region(s) homologous to organism of interest is inevitable. Such complications could be avoided using standardized barcode regions sequenced using optimized primers. This chapter discusses about the potential problems encountered due to the unknown/unintentional/intentional use of nonhomologous barcode regions and the need for primer optimization.
- DNA barcodes
- homologous gene sequences
- standardized barcode regions
- primer optimization
Deoxyribonucleic acid (DNA) is considered as the prime genetic material of the living world as it stores complete set of information for dictating the structure of every gene product. The order of nucleotide bases (viz. adenine, guanine, cytosine, and thymine) contains these instructions for genetic inheritance along DNA . “DNA sequencing” refers to a technique for understanding the language of DNA by determining the order of nucleotide bases present within the genome of organism(s) of interest [2, 3]. During the 1970s, researchers utilized two-dimensional chromatography for obtaining the first DNA sequence in laboratories. Later, dye-based sequencing methods with automated analysis were developed for easier and faster DNA sequencing. With the continued improvement in sequencing approaches, DNA sequence data derived from genes and genomes of organisms have become indispensable in basic research and allied fields.
Advancements in DNA sequencing and computational technologies influenced almost all areas of biological sciences. Taxonomy and systematics, the science for identifying organisms up to “species” level followed by classifying them based on their relationships, are also well complimented by DNA sequence database. Traditionally, “species,” the basic unit of taxonomy, is distinguished on the basis of certain unified external characters within a sufficient number of specimens termed as “morphological characters” . Later, morphological-type specimens were complimented with molecular data from molecular markers (allozymes, nuclear DNA, mitochondrial DNA) specifically in morphologically problematic groups [5, 6]. As molecular markers, gene type sequences (referred to as DNA barcodes) are developed using a technology called “DNA barcoding.” Thus, fundamental information from conventional taxonomy is complimented with genetic information from molecular taxonomy for scientific inferences. More than a decade, this technique has been subjected for prime consideration in molecular research due of its capability to distinguish closely related species. It is also applicable to a broad spectrum of taxa for extensive biodiversity assessment studies. DNA barcoding remains as a standard method for specimen identification and allied studies, and DNA barcodes serve as an inevitable tool in understanding genetic relationships of organisms [7, 8, 9, 10, 11].
An ideal DNA barcode should possess certain qualities like higher universality and resolution. Since DNA barcoding relies on different DNA regions that vary between organisms (like bacteria, plants, animals, birds, etc.) [12, 13, 14], selection of barcode region is dependent on the selected sample type. DNA barcode sequences are normally compared with a DNA reference library of morphologically pre-identified vouchers to assess the rate of similarities/dissimilarities, followed by assignment of taxonomic names to unknown specimens according to the percentage of identity [15, 16]. Since homology relations are proportional to the origin and relations of taxa, focusing on molecular characters to examine homology relations is more direct than on morphology due to the discrete and “simple” nature of the characters in the latter. Thus, comparative sequence analyses are apparent to analyze the biological relationships of DNA sequences. Two major disciplines that work at both interspecific and intraspecific level are molecular phylogenetics and population genetics. Molecular phylogenetics deals typically with evolutionary relationships of different species, while population genetics is applied to characterize variations within and among populations of a single species [17, 18]. In short, DNA barcodes from standard gene region(s) compliment taxonomy, population analysis, and phylogenetic and evolutionary studies at genetic level. They are also utilized for identification of species, particularly for eggs, larvae, and commercial products [19, 20].
Among DNA barcodes, mitochondrial genes gained preference due to their higher stability, mutation rate, copy number per cell, and absence of introns that provide higher genetic information . Among mitochondrial genes, cytochrome c oxidase I (COI) is considered as the primary barcode sequence for animal kingdom [10, 15]. Mitochondrial DNA (mtDNA) has been used for carrying out phylogenetic studies in a large number of animals including crustaceans, in a short span of time. Hitherto, numerous reports in support of broad benefits of DNA barcoding are available [15, 16, 22, 23, 24, 25]. Even though DNA barcoding has completed a decade as one of the versatile techniques in addressing numerous concerns in the field of life science, authors like [26, 27, 28] have also pointed out many drawbacks with respect to this technique. A recent study  reported issues regarding the usage of nonhomologous barcode sequences for molecular studies. This chapter discusses on the nonhomologous barcode regions of COI gene region, available in public database (like NCBI) and issues arousing due to their unknown/intentional/unintentional use in molecular analyses. Molecular results inferred from mitochondrial COI gene sequences (amplified using “Folmer” and “Palumbi” primers) of
2. An overview of mitochondrial cytochrome oxidase subunit 1 (mtCOI) gene of
DNA was first detected in mitochondria in the year 1963. It was found in association with proteins and lipids, localized to the mitochondrial matrix . Almost all eukaryotic cells possess mitochondrial genome that contains genetic information utilized in systematic and population genetics for the past two decades . Complete mitochondrial DNA (mtDNA) sequence having approximately 17,000 base pairs (bp) has been developed in many species, including humans . Maternal inheritance, relatively rapid mutation rate, and lack of intermolecular recombination are considered as major characteristic features for their extensive use in population structure and phylogenetic studies at different taxonomic levels . Hitherto, more than 1100 complete mitochondrial genome sequences or similar derivatives have been published . However, crustaceans, one of the most morphologically diverse animal life forms, are represented only by limited number of complete mitochondrial sequences. Within crustaceans, decapods represent an extremely diverse group with many commercially important taxa including prawns, shrimps, lobsters, and crabs [7, 30]. Two major COI barcode regions are amplified for them using two sets of primers, namely Folmer (aka 5′ COI; LCO-HCO)  and Palumbi (aka 3′ COI; Jerry-Pat) , which are nonhomologous with limited overlaps [7, 35]. These two regions are widely used in decapod molecular taxonomy and associated research. In public database (e.g., NCBI), several decapod species possess COI sequences derived from both these regions. Among them, the giant freshwater prawn
3. Impact of nonhomologous barcode regions in molecular taxonomy and allied studies
3.1 Specimen identification
DNA-based taxon identification for recognition of known species and discovery of new species is reported in many studies [7, 15, 23]. Mitochondrial cytochrome oxidase I (COI) gene is recommended as an efficient DNA barcode for identifying all kinds of animals [15, 16, 23], including cryptic species [18, 24]. Pairwise comparison of COI sequences of congeneric species generates a divergence rate of >2% , reaching up to 3.6% in species complexes, and exceeds 5% in rare cases [15, 16, 23, 24]. The region of the 5′ end of COI (“Folmer” portion) is considered as the “DNA barcode” sequence which might be no better than that of the 3′ end of COI sequences, i.e., Palumbi sequence [7, 35, 36]. Even though these two regions are considered as related fragments, even within crustaceans [36, 37], the region-specific conservation for “Folmer” and “Palumbi” sequences creates nonhomology. This creates diversity within the same gene region, causing misinterpretations if it is used unknowingly.
Here, the results inferred from nonhomologous COI gene regions of
Results inferred from the NJ tree exhibited reciprocal monophyly in its array, differentiating “Folmer” and “Palumbi” regions as two different entities. Out-group species that was expected to have higher divergence than the rest showed affinity toward the “Palumbi” sequences of
Inferences from phylogenetic tree will also be reflected in genetic distance data since the substitution accounted for calculating intraspecific divergence within
3.2 Population analysis
COI gene sequences are well considered for population analysis of many species including decapods. Here, the impact of nonhomologous regions in population studies is discussed using “Folmer” and “Palumbi” sequences of the genus
The tree topology was expected to reveal only two highly diverged populations of
3.3 Molecular phylogeny
The influence of nonhomology in phylogenetic studies was examined using “Folmer” and “Palumbi” sequences of
Tree topology exhibited cladistic array of selected organisms in accordance with the previous findings of specimen identification and population analysis, i.e., with respect to the region-specific conservation persisting within “nonhomologous” barcode regions of COI gene. Monophyly of
After the first discovery of mitochondrial DNA in 1963, more than 5300 complete mtDNA sequences of different taxa were submitted in NCBI till date [29, 31, 32]. These sequences are well utilized for addressing different fields of molecular taxonomy . Among the preferred gene regions of mitochondrial DNA, cytochrome c oxidase subunit 1 (COI) remains as one of the most recommended molecular markers because of its ability to generate sequence data within a reasonable time in a cost-effective way. These data could be well utilized for sorting collections into identified species, biodiversity assessments, delineation of cryptic species, detection of population structure, gene flow pattern identification, phylogeographic studies, molecular phylogeny, evolution, etc. [7, 17, 39, 40, 41]. Altogether, this protein-coding gene mitochondrial gene has acquired great acceptance in large-scale projects of diverse taxa [42, 43, 44].
Usually, an ideal COI barcode region is reported to possess about 648–700 nucleotides that are used for similarity searches in nucleotide database for identification of known/unknown samples. Barcode of Life Data System (BOLD) refers to a freely available database which acquires analyses and releases DNA barcode data. Researchers interested in DNA barcoding and allied studies can submit sequence(s) to the public database (NCBI/DDBJ/EMBL) or the consortium for the Barcoding Life website. Similarity search using nucleotide BLAST (BLASTN) or BOLD search (www.barcodinglife.org) is usually used for identifying the status of DNA sequence of interest. This will lead to the corresponding homologous nucleotide sequence(s) for your DNA sequence that has been sequenced previously or will give homologous sequences of its close relative(s). But there exists some hiding factors that could confuse a researcher to identify a species from the available database, i.e., nonhomologous sequences with region-specific conservation (e.g., COI gene) which may alter results to a great extent. Hence, these results are to be scrutinized carefully since there may be nonhomologous sequences for a taxa of interest that will not appear in the BLAST search because of their nonhomology. Altogether, similarity searches with the available database results in top species matches, where the name of the species having reference sequence accessioned in the database or the name of the closest related taxa in the absence of reference sequence for that particular species, are enlisted .
This chapter has clearly discussed about the presence of “Folmer” and “Palumbi” regions of
However, the presence of these two barcode regions within a single target gene (COI) with “Folmer” region as the first and “Palumbi” as second barcode region for a broad class of organisms (particularly crustaceans) could result in severe problems with respect to molecular studies. It could be helpful if full COI sequences of taxa are included as a scaffold for containing the two fragments prior to molecular analysis. But many organisms are still lacking the whole genome mitochondrial sequence data, and instead they are having either “Folmer” or “Palumbi” sequence(s). In such a scenario, if a “Palumbi” region for a specimen is sequenced and the public database is having only the “Folmer” region for the same organism, BLAST search will fail to identify that particular sequence. It will indicate only the closest organism on the basis of sequence homology. For “Folmer” region also the case will be the same if database is having “Palumbi” sequences. This mainly affects specimen identification as it could be hard for a researcher to identify the sample, particularly for those specimens lacking major morphological characters. Even in specimens with little morphological variations from its type descriptions, there could be failures in identifying the existing species causing misinterpretation of the same as a novel species. Regarding the impact of nonhomology in population analysis, the chance of misinterpretation of “nonhomologous” fragments as a different population exists.
Molecular phylogeny could also be affected from severe errors due to dual barcode regions. Even if both barcode sequences are contained within the nucleotide database, BLAST search will enlist sequences according to the homology of our sequence(s) only. In such cases, the chance for errors could be minimized even though the full dataset is not explored. However, there could be possibilities of missing dataset of taxa supplied to the database, due to nonhomology. Another case is that, even among congeneric species, monophyly could not be established due to the impact of “nonhomologs” (refer to Figure 4b and c). Tree topology could be altered due to the impact of dual barcode regions. As a result, relationship between morphologically similar species and species groups could be altered.
5. Conclusions and recommendations
GenBank accounts for an enormous amount of molecular data within which more than 90% of mtDNAs belongs to metazoans and the remaining sequences represent fungi and terrestrial plants. About 3% of available mitochondrial genomes represents protists. Despite the usually discussed issues like misidentifications and pseudogenes, lack of primer pair(s) data, particularly for certain unpublished dataset, remains as a major drawback. Moreover, designing and using of multiple primer pairs for various objective of molecular taxonomy have generated multiple DNA fragment from the same gene. Hence, under a single species name, there could be multiple DNA sequences from a single gene, which are “nonhomologous” in nature. It is very basic that selection of nucleotide sequences could be done on the basis of their homology but still, in the present scenario, it is hard to identify homologs by BLAST search. Most trace samples, used in forensic studies, remain undetected due to the lack of standardization in barcode regions. Even though “Folmer” region is considered as a universal barcode region, there are numerous reports regarding the use of “Palumbi” sequences as better barcodes based on species specificity. This issue could be well resolved with the use of complete genome sequences which, however, are developed only for limited taxa. A better way to resolve this problem of “nonhomology” is to provide data regarding primer pair(s) used along with the nucleotide sequence data. Even if someone is concerned about the privacy of research, they could opt the embargo period provided for releasing the nucleotide sequence data to public database. It is also true that researchers develop diverse primers for amplifying specific genes. One more recommendation is to update the nucleotide submission data after publication of the corresponding manuscript so that the entire research community could get the proper information regarding the background of the nucleotide without interfering one’s privacy. DNA barcoding has crossed the boundaries of academics and has made use of in food authentication, medical applications, forensic science, etc. Since DNA-based analysis has become an important part, region-specific issues related to gene sequences need to be addressed and resolved. This chapter has addressed the present nature of barcode regions derived from COI and the issues related to them so that the need for primer optimization  could be practiced at the earliest. Regardless of a single species (
Authors greatly acknowledge the Kerala State Council for Science, Technology and Environment (KSCSTE) for providing financial support. We gratefully acknowledge the director, School of Industrial Fisheries, Cochin University of Science and Technology, for providing necessary facilities and support for conducting this research.
Conflict of interest
The authors declare responsibility for the entire contents of this chapter and have no conflicts of interest.