A decade ago Sites and Marshall  described the empirical practice of species delimitation as “a Renaissance issue in systematic biology”. At the time there was an odd disconnect between the two frequently stated empirical goals systematic biology: the discovery of: (1) monophyletic groups (clades) and relationships within these at all hierarchical levels above species; and (2) lineages (species); compared to the actual practice of the discipline. While much of systematic biology had been devoted to the first goal, the second goal had until recently been largely ignored , despite the fact that species are routinely used as the basic units of analysis in biogeography, ecology, evolutionary biology, and conservation biology [3,4]. However, Sites and Marshall  noted “signs of a Renaissance” at the time of their review, which was precipitated in part by others emphasizing the need to distinguish between a non-operational, ontological definition of species, versus the empirical (operational) data needed to test their reality [5-7]. De Queiroz  (p. 60) noted that “All modern species definitions either explicitly or implicitly equate species with segments of population level evolutionary lineages.” De Queiroz also noted that this was a revised version of Simpson’s “evolutionary species concept”, which defines a species as “a lineage (an ancestral- descendent sequence of populations) evolving separately from others and with its own evolutionary role and tendencies” (, p. 153), and called this a General Lineage Concept (GLC) of species (, p. 65). De Queiroz  further emphasized that the multiple empirical criteria simply reflect the many contingent properties (differences in genetic or morphological features, adaptive zones or ecological niches, mate-recognition systems, reproductive compatibility, monophyly, etc.) of diverging populations associated with different evolutionary processes operating in various geographic contexts [10,11]. Sites and Marshall  noted that the emerging consensus among systematists and evolutionary biologists was based on the utility of this distinction (ontological definition vs. empirical species delimitation [SDL] methods), and as also noted by de Queiroz , due to the contingencies of speciation processes, any single criterion or data set will artificially reduce the complexity of evolving lineages.
The subject matter of these and other reviews [12,13] focused strictly on methods of detecting various lines of evidence for lineage independence (reproductive isolation, ecological distinctiveness, diagnosability, monophyly, etc.), and since then new methods continue to be described , as do studies comparing the performance of some of these [14,15]. In 2006, the Society of Systematic Biologists (SSB’06) organized the first symposium dedicated to the topic of species delimitation ; 11 papers were presented and six of those published, including an update by referenced de Queiroz , which emphasized the distinction between the GLC as “separately evolving metapopulation lineages, or more specifically, with segments of such lineages”, versus secondary biological attributes or properties of organisms that can be quantified to empirically test for species status. This is a crucial distinction because it clearly separates the conceptual issue of defining the species category from the methodological issues of delimiting species; previously these had been conflated with the result that properties used to infer species boundaries (the empirical test) were also sometimes regarded as necessary for defining a species (a conceptualization issue). The advantage of the unified GLC is that no specific biological attributes of a species are considered necessary properties – species may exist as segments of metapopulations lineages regardless of our ability to empirically delimit them. Prior to this clarification and the realization that many different properties are relevant to the issue of species delimitation , the alternative species “concepts” in which various biological attributes had accumulated in diverging lineages required these same attributes to be necessary properties of species. This led to a confusing situation in which a different property was considered necessary under each alternative concept (22 such “concepts” were identified by Mayden ), and a long and ultimately non-productive debate about species definitions. Now most of these earlier “concepts” can be viewed as secondary species criteria that provide evidence of lineage separation.
Recently, Hausdorf  argued for an up-dated ontological species concept, based in new insights into speciation processes, particularly evidence that reproductive barriers are semi-permeable to some gene flow, and that speciation may occur despite ongoing gene flow between diverging populations [18-23]. Two other lines of evidence are relevant to the point of re-visiting the GLC: (1) findings of polyphyletic species of animals, due to parallel speciation in which similar traits conferring reproductive isolation arise separately in closely related populations [24,25], or in plants, due to recurrent polyploidization in different populations of the ancestral species [26,27]; and (2) discoveries of uniparental organisms that can be characterized as distinct units resembling species of biparental organisms . We cannot resolve all of these larger issues here, but we return to some of the general points raised by Hausdorf  in the discussion.
Empirically, species delimitation continues to be a topic of increasing interest in evolutionary biology. A reference search in the ISI Web of Science with the keyword ‘species delimitation’ retrieved 227 articles published since 2000, of which 60% were published after 2008. Less than 10 articles per year were published between 2000 and 2005; subsequently 10-20 articles per year between 2006 and 2008, and after 2008 the publication rate reached ~ 40 articles (Figure 1A). These increases include papers describing new SDL methods, or using existing methods with novel data sets and/or applications to new taxa. Because new SDL methods apply the same coalescent models developed for species tree estimation and usually lead to the discovery of morphologically ‘cryptic’ species, we also searched for references with the keywords ‘species tree’ and ‘cryptic species’. During the same period of time, papers about ‘species trees’ were few until 2007, increased between 2008 and 2010 to 5-10 articles per year, and nearly doubled to >20 papers last year (Figure 1B). Publications referring to ‘cryptic species’ show a constant increase from 20 papers/year in 2000 to 90 papers/year in 2011 with the larger annual increase between 2010 and 2011 (Figure 1C). These publication trends suggest that the recent paradigm shift in phylogenetic systematics to incorporate species trees (29) is having a positive impact on the development of new SDL methods, which are gradually being incorporated into integrative taxonomic practices for the discovery of cryptic species diversity .
2.1. Short history of some early methods
Sites and Marshall [1,13] separated SDL methods into non-tree and tree-based approaches, and included among the former (1) pairwise genetic distances that could be tested for either correlations with reproductive isolation [31,32], morphological distances , or geographic distances ; (2) gene flow statistics to estimate the extent of gene flow across hybrid zones ; (3) fixed alternative character states as an indicator of no gene flow in a “population aggregation analysis” (PAA; ); (4) the presence of heterozygous genotypes as an indicator of a “field for recombination” ; and (5) genotypic clusters .
Early tree-based methods included: (1) three versions of the phylogenetic species “concept” based on apomorphy, or lineage splitting, or node-based criteria, following the terminology of Brooks and McLennan ; (2) cladistic haplotype aggregation ; (3) molecular-morphological assessments using dichotomous flow charts ; (4) genealogical exclusivity ; and (5) an extension of the nested clade analysis  that includes tests of species boundaries . The data sets in these early studies most often included genotypes resolved from multilocus isozymes , morphological (usually meristic) characters, and with few exceptions [45,46], mitochondrial DNA (mtDNA) sequences. An innovative phylogenetic method described by Pons
The published contributions of SSB’06 symposium  included several novel SDL methods, the first method  described a coalescent approach to estimating species boundaries based on multiple unlinked gene trees, and that does not require species to be characterized by reciprocal monophyly. This is an explicitly model-based approach that accommodates stochastic variance of the gene sorting process by linking estimates of two key parameters, a range of estimates of effective population sizes relative to possible divergence times. This type of gene tree-coalescence approach also directly links population genetic SDL methods to phylogenetic inference at deeper levels of divergence, which has been identified as a “new paradigm” in systematics . In this same issue, Shaffer and Thomson  introduced a population genetic SDL based on large sets of single nucleotide polymorphisms (SNPs), which would be most suited to delimiting very young species. Finally, this volume included two more novel SDL methods, both in this case using ecological and distributional data in novel ways to model “niche envelopes” that can augment molecular or morphological data in species delimitation [49-51].
2.2. Recent progress
2.2.1. New methods & new theory
New empirical SDL methods continue to be developed, based on multiple lines of evidence and multiple statistical methods. Among some of these is the approach of Bond and Stockman  that is especially relevant to highly geographically-structured populations in which traditional sequence-only data sets are likely to recover large numbers of well-defined, well-supported, and geographically concordant/genetically divergent-but-morphologically cryptic populations (species). These authors describe a framework for testing potential genetic and ecological exchangeability as a means of delimiting cohesion species , and present an example in trapdoor spiders of the
The recent merge of coalescent theory with phylogenetics has driven a new generation of SDL methods and a new paradigm in systematics . This new theoretical framework, and its derived analytical applications, was in part required as a solution for accommodating the observed conflict among genealogies from multiple loci (gene trees) with the underlying population-level genealogies (species trees) . A multi-species or ‘censored’ model was formulated to account for this discordance by considering each branch of the species tree as a separate coalescent model and by connecting them into a population-level genealogy following the topology of the species tree [62,63]. Under this new approach, two major key innovations over the classic phylogenetic methods were achieved. First, multiple individual samples can be assigned to a single species and the estimated phylogeny represents the speciation history of ancestral and descendant species-level lineages, in contrast to the gene genealogies estimated with individual samples. Second, because the coalescent process of each gene tree is dependent upon parameters of its containing species tree, this approach can co-estimate gene and species tree simultaneously, by-passing the task of calculating a consensus tree or estimating a phylogeny from a concatenated dataset. This new theoretical framework allows prediction of the probability distribution of gene trees given the species tree, and consequently, several methods were developed for estimating species trees from a collection of multiple gene trees under different algorithms [64,65]. Based on these new methods, a generation of fully-coalescent SDL methods was introduced that consisted of selecting the best species-tree model from a set of alternative models that represent different hypotheses of species limits. For instance, one approach finds the maximum-likelihood for the full species tree (all species are hypothesized as separate lineages) and for alternative species trees (two or more species at a given node are collapsed into one), and then selects the best model using Akaike information criteria, assuming fixed gene trees and constant population sizes along the species tree (SpeDeSTEM; ).
Another SDL method consists of sampling from the Bayesian posterior distribution of species delimitation models using reversible-jump Markov chain Monte Carlo (rjMCMC) with the program BP&P 2.1 . This approach accomodates gene tree uncertainty and variable population sizes, but a “known” species tree must be provided
2.2.2. New kinds of data
The development of new multi-species/multi-locus SDL methods was also in part due to the demand of efficient analytical tools to handle the rapidly increasing amounts of molecular data collected with modern techniques. New SDL methods should be able to handle tens of loci for multiple individuals derived from the development and screening of anonymous nuclear loci (ANL), introns, and protein-coding loci using genomic resources [71-73]. However, these new SDL methods are inadequate to analyze the influx of whole-genome data that have started to be collected for non-model organisms via next-generation sequencing (NGS) technologies ([74-76]; e.g, genome of the lizard
More efficient and less costly whole-genome sequencing is becoming available on a regular basis, a trend that started with the first-generation technology (Sanger capillary-sequencing), followed by the second-generation (i.e., SOLiD 454, Illumina, Solexa, etc; ), and continuing today with the recently introduced third-generation ‘nanopore’ sequencing [84,85]. A significant by-product of these single-molecule sequencing methods is their ability to automatically resolve the allelic phases of heterozygotes, in contrast to the time-consuming phase estimation and/or cloning required after direct dideoxy-sequencing . In addition, the uniform sampling of hundreds of loci across the genome can help identifying “outlier” loci via genome scans, which can represent candidate genes with fitness value, subject to selection and linked to processes such as ecological speciation .
2.2.3. Advantages of Multi-Species Coalescent-Based Methods (MSCM)
2.2.4. Disadvantages of MSCM
Many of the advantages listed above also impose some limitations of MSCM and other SDL methods for different reasons. First, these are
Another frequent assumption of most MSCM is that species have diverged from a common ancestral species without gene flow even though speciation with gene flow seems to be rather common in nature, especially in cases of ecological speciation [22,95,105]. While these methods ignore the effects of gene flow, simulation testing has shown that some of them are relatively robust to low levels of gene flow [66,92], and that its impact on delimitation accuracy is ameliorated when gene flow is explicitly incorporated in the speciation model . This result supports the suggestion that, in order to distinguish between species- and population-level differentiation, it is necessary to jointly consider the two components of the divergence process: time since splitting and gene flow after divergence .
Third, coalescent-based SDL approaches assume
Fourth, as in other methods
Fifth, there may be conflicts with
3. Future directions
In order to provide a preliminary evaluation of the impact of sampling design on performance of new SDM, we simulated coalescent genealogies with the program ms  and sequence data with the program Seq-Gen  for a speciation model between species A and B for three increasing divergence times: 0.25, 0.5, and 1Ne (Figure 2A). We assumed a constant θ per site = 0.01, 500 bp per locus, and ~50 variable sites per locus. For each divergence time, we simulated 5 combinations of number of loci (1, 2, 4, 10, and 20) and number of sequences per species (1, 2, 5, 10, and 20) while keeping the total sequencing effort constant (20 sequences per species). We simulated 100 replicates for each sampling treatment which were analyzed with BP&P to calculate the mean speciation probability between species A and B across replicates, which represents the accuracy of the method (i.e., the probability of detecting speciation when it is the true hypothesis). We also simulated a no-speciation model where sequences from species A and B were collapsed into a single lineage, and repeated the same sampling and analytical procedure to examine the performance of the method based on a plot of true positive and false positives rates (i.e., ROC plot; ).
The results show that under the conditions examined, more sequences per species is better than more loci at least in the range of 1-20 loci and sequences per species (Figure 2B). The ROC plots for the 5 sampling treatments at a divergence time of 0.5Ne show that performance is higher (i.e., area under ROC curve is larger) when sampling 20 sequences for 1 locus or 10 sequences for 2 loci, but performance gradually decreased with more loci and fewer sequences (Figure 2C). These results are congruent with the impact of sampling design on the accuracy of species-tree methods (STM) at shallow divergence times [115,116], which is an expected outcome because both STM and SDL methods share the same basic multispecies coalescent model [67,117]. However, our results are contingent upon the conditions simulated, in particular the assumptions of panmixia within species, and a constant θ across the species tree. This second assumption is a critical parameter of coalescent models, which can be estimated more accurately with a larger sample of loci . Our attempt with this simulation example was to show how we can evaluate the performance of a SDL method under a variety of sampling conditions based on a power analysis, and that this same approach can be applied for comparisons across different SDL methods and more complex speciation scenarios than those that have been examined so far.
A potential protocol for an informed species delimitation approach that takes into account population structure, could consist of first applying a clustering/population aggregation method to identify the smaller clusters of individuals under a population genetics criterion based on genotype or allele frequency data ('e.g., Structure 48, 58, 60). Subsequently, a SDL method can be applied to test if these clusters also represent independent evolutionary lineages based on the pattern of allele coalescence in gene genealogies (e.g. BP&P). Because initial population divergence starts with differentiation in allele frequencies and secondly, with random lineage sorting and mutation that further differentiates lineages during speciation , population genetics approaches are expected to detect lineages earlier than SDL approaches. For example, an empirical analysis of West African forest geckos (
There is an ongoing genomics revolution for the study of adaptation in ecological and evolutionary non-model organisms derived from (NGS) technologies [76,128]. Decreasing sequencing costs and new protocols for discovering and screening thousands of markers scattered throughout the genome , is now allowing application of population genomics approaches to identifying the candidate loci underlying adaptive traits with ecological significance . In fact, recent studies have found genomic regions and/or specific loci related to repeated local adaptation, population divergence, and reproductive isolation between ecotypes in different habitats or hosts [129,130]. We anticipate that these ‘speciation genomics’ approaches will become more common in non-model organisms and will provide a basis for species delimitation in scenarios of adaptive speciation SDL methods, complementing current SDL methods. Moreover, this plurality of criteria for species delimitation based on multiple kinds of traits is consistent with the GLC of species that views these organismal traits as evolving in different temporal order depending on how speciation has actually taken place [9,12]. In addition, it is also compatible with the more recent ‘differential fitness’ concept, which is based on those organismal features of one species that have negative fitness effects in other species and cannot be exchanged upon contact .
AC acknowledges a postdoctoral fellowship from CONICET (Argentina). For financial support we thank thank NSF awards OISE 0530267 and AToL 0334966 to JWS, as well as BYU graduate research and graduate mentoring awards, and student research awards from the Society of Systematic Biologists and the Society for the Study of Amphibians and Reptiles, to AC. We both also received support from the BYU Dept. of Biology and the Bean Life Science Museum.