Open access peer-reviewed chapter

Dealing with the Data Deluge – New Strategies in Prokaryotic Genome Analysis

By Leonid Zaslavsky, Stacy Ciufo, Boris Fedorov, Boris Kiryutin, Igor Tolstoy and Tatiana Tatusova

Submitted: April 16th 2015Reviewed: December 8th 2015Published: January 14th 2016

DOI: 10.5772/62125

Downloaded: 1776


Recent technological innovations have ignited an explosion in microbial genome sequencing that has fundamentally changed our understanding of biology of microbes and profoundly impacted public health policy. This huge increase in DNA sequence data presents new challenges for the annotation, analysis, and visualization bioinformatics tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data. Genomes are organized in a hierarchical distance tree using single-copy ribosomal protein marker distances for distance calculation. Protein distance measures dissimilarity between markers of the same type and the subsequent genomic distance averages over the majority of marker-distances, ignoring the outliers. More than 30,000 genomes from public archives have been organized in a marker distance tree resulting in 6,438 species-level clades representing 7,597 taxonomic species. This computational infrastructure provides a foundation for prokaryotic gene and genome analysis, allowing easy access to pre-calculated genome groups at various distance levels. One of the most challenging problems in the current data deluge is the presentation of the relevant data at an appropriate resolution for each application, eliminating data redundancy but keeping biologically interesting variations.


  • Genome analysis
  • clusters
  • proteins
  • bacteria
  • prokaryotes

1. Introduction

Prokaryotes are probably the largest and the most diverse group of cellular organisms.

The number of described species is now about 12,000, and the number of species on earth is estimated in the millions [1]. Recent rapid advances in sequencing technologies provided a relatively cheap and fast way of studying the diversity of microbial species by discovering representatives of novel divisions or even phyla [2] and analyzing the variation within the species by sequencing closely related genomes from the ecological microbial populations or clinical studies of pathogenic bacteria.

Historically, prokaryotic organisms were organized by classical taxonomic ranking system (species, genus, family, order, and phylum). Delineation of prokaryotic species was originally based on phenotypic information, pathogenicity, and environmental observations. Due to the high mutation level, fast replication rate, and efficient DNA exchange mechanisms, microbial organisms can easily adapt to their habitats. Genomic studies have shown that different species living in similar ecological environments show similarity at the genomic level (e.g., congruent evolution of water-living bacteria from various taxonomic origins [3]) while same pathogenic species (or symbionts) rapidly adapting to the new hosts become quite different at genomic level (e.g., Buchnera aphidicola[4], Serratia symbiotica[5]).

Next-generation sequencing technologies provide new insights into the life of microbes and their interactions with the host, but they do not classify the organisms in a traditional way. Many novel species are described as “candidatus” or “<genus> sp.”

The genomes of uncharacterized isolates of the Candidatus Arthromitus, host-specific intestinal symbionts, comprise a distinct clade within the Clostridiaceae [6].

The number of uncharacterized species is rapidly growing in public genome collections. As of November 2014, almost half of bacterial and archaeal species in NCBI Refseq data set remain uncharacterized. (Bacteria: 3,559 uncharacterized, 7,597 total; Archaea: 162 uncharacterized, 399 total.)

The need for different approaches to the identification of microbial species that can take into account the advantages of the growing massive volume of genomic sequence data is being actively discussed in the research community.

Scientists from different disciplines (taxonomists, ecologists, and evolutionary biologists) have different interpretations of species defined by the framework of their needs and the tools they use for identification. A recent review [7] describes the history and present state of various methods of description of prokaryotic species. The authors suggest the concept of species as “a category that circumscribes monophyletic, and genomically and phenotypically coherent populations of individuals that can be clearly discriminated from other such entities by means of standardized parameters.”

Comparative analysis requires a target: a coherent group of isolates with some degree of similarity defined by the goal of the study (the analysis of pathogen outbreak performed at the species level or below, while biodiversity studies use broader group such as families or phyla). Several groups have attempted to delineate the taxonomy of Archaea and Bacteria using the methods based on single-copy universally conserved markers [8-13]. Other methods are discussed in recent reviews [14].

Different species vary dramatically in terms of the sampling density and data quality. Clinical and epidemiological studies produce large data sets of closely related (clonal) genomes (Table 1), while other species are sampled very coarsely. Genomic and proteomic structure of a densely sampled group of related strains is commonly described by the concept of pan-genome [15] species.

The complexity of the data is challenging to the analysis, representation, and visualization of the data sets. One of the challenges is the amount of the resources required for a brute-force processing approach (e.g., BLAST all-to-all of 35 million proteins will take five days on 1,000 processors). Another big problem is data heterogeneity and redundancy: the closest-neighbor results will often contain long list of nearly identical objects, making it difficult to identify more distant neighbors.

Here we describe a combined approach that provides a robust, fast, and scalable method of defining the sequence similarity genome groups that can be used for comparative genome analysis and resolve some known issues with the delineation of species in traditional taxonomy.


2. Materials and methods

The genomes are organized in hierarchical groups calculated with different methods. The universal ribosomal markers approach is used to build a distance tree and to define species and superspecies-level clades (genome groups). The species-level clades are further refined by using whole-genome alignments and creating tight (clonal) genome groups.

2.1. NCBI hardware and software

The hardware available at NCBI includes a Univa Grid Engine (UGE) Grid-Engine-based computer farm and PanFS scalable storage system connected through a powerful router. The most recent version UGE 8.2.1 includes support for Linux GROUPs, support for Window server execution nodes, and a beta version of DRMMA2 (Distributed Resource Management and Application API 2). A large weakly coupled distributed computer system like this requires coarse-grained parallelization approaches with minimal communication between the processes. Many processing steps, such as computing BLAST hits, are naturally parallel.

2.2. Data snapshot

A given data snapshot represents a collection of genome (and protein) sequences and metadata available at the time. Navigating through millions of nucleotide sequences in public archives to find a set that comprises a whole-genome collection can be sometimes challenging. GenBank release 207 contains 182,188,746 sequences, and 189,739,230,107 nucleotides. The traditional NCBI sequence repository was designed for GenBank records in the early 1990s. It is organized as a collection of single-nucleotide sequence records with annotated sequences stored as nucleotide–protein sets. By GenBank requirements, each sequence record should be associated with the organisms registered in the NCBI Taxonomy Database. For the first 10 years of microbial genome sequencing, each species has a unique genome representation in public sequence archives. When sequencing costs decreased, researchers began to explore microbial population structure and the intraspecies differences. NCBI Taxonomy group began assigning Taxonomy ID for strain level nodes as proxies of unique genome identifiers. More recently, next-generation sequencing and rapid pathogen detection approaches have shifted the paradigm from a single isolate representing an organism to multiisolate projects often representing almost identical isolates from the outbreak analysis. These closely related genomes differ by metadata only: patient information, date, and place of sample collection. NCBI has created new resources that capture the sequence data and metadata information: BioProject, BioSample, and Assembly [16]. A triplet of these identifiers uniquely defines a genome with the metadata that can be used for further comparative analysis.

NCBI internal database UniCol is used to store collections of the nucleotide and protein sequence data associated with every BioProject, BioSample, Assembly triplet. The database provides a tracking history for a given snapshot with the sequence assembly and metadata available at the time.

Clade_idNameGenomesClonal groupsTaxonomy
19988Staphylococcus aureus4182118species
19668Escherichia, Shigella2479986multiple
20829Mycobacterium tuberculosis184411species
19252Helicobacter pylori432258species
19672Enterobacter, Klebsiella384149multiple

Table 1.

Calculated clades may include a single species, a single genus, or multiple genera for closely related species.

2.3. Genome quality assessment

There are several criteria that are used to evaluate the quality of genome assembly.

The N50/L50 metrics are automatically calculated for each genome. Acceptable values are dependent on genome size, and genomes which do not meet the criteria are not processed for Refseq. For known clades, the genome size is expected to fall within 2 standard deviations from the mean for clades, which have at least 10 members. This standard allows for the identification of partial genomes and unusually large genomes, which may indicate a bad assembly or contamination.

Some genomes submitted to GenBank represent an assembly from a mixed culture (accession # AKNF01000000 is a mixed culture of Shigella flexneri1235-66 and an unknown Shigellaspecies) or a hybrid of different species or a chimera genome (accession # AP012495 chimera genome constructed by cloning the whole genome of Synechocystisstrain PCC6803 into the Bacillus subtilis168 genome). Partial and “anomalous” assemblies are clearly flagged in NCBI assembly database and not included in clade analysis.

2.4. Marker to genome alignment

Genome distance is defined as an average of pairwise protein distances of universally conserved single-copy proteins as defined in [8] (Table 2).

Genomic markers (E.coli K12 accessions)Genomic markers
NP_417801ribosomal protein S12
NP_417800ribosomal protein S7
NP_414564ribosomal protein S2
NP_418410ribosomal protein L11
NP_418411ribosomal protein L1
NP_417779ribosomal protein L3
NP_417774ribosomal protein L22
NP_417773ribosomal protein S3
NP_417769ribosomal protein L14
NP_417767ribosomal protein L5
NP_417765ribosomal protein S8
NP_417100ribosomal protein l6p/L9E
NP_417762ribosomal protein S5
NP_417757ribosomal protein S13
NP_417756ribosomal protein S11
NP_417698ribosomal protein L13
NP_417697ribosomal protein S9
NP_417634ribosomal protein S15P/S13E
NP_417770ribosomal protein S17
NP_417772ribosomal protein L16/L10E
NP_417760ribosomal protein L15
NP_417763ribosomal protein L18
NP_417755ribosomal protein S4

Table 2.

List of genomic markers used in genomic analysis. Escherichia coli K-12 accessions are given as an example. Each marker has a corresponding protein cluster which is used in the analysis.

2.5. Genome distance

Protein marker distances and genomic distance are designed to be robust while remaining appropriately sensitive. Protein distance measuring dissimilarity between markers of the same type is designed to ignore differences in protein lengths and tuned to measure dissimilarity in internal parts of the sequences. The subsequent genomic distance averages over the majority of marker-distances, ignoring the outliers.

2.5.1. Protein distances

Consider proteins i and j, having the best aggregated BLAST alignment of length Lijwith aggregated score Sij. Assume that the proteins have lengths Li and Ljand self-scores Siiand Sjj. Define normalized scores: sij=Sij/Lij, sii=Sii/Li, sjj=Sjj/Lj.

Then define protein distances:

dij=1min(1, sijmin(sii,  sjj))E1

Distance (1) is an identity-like characteristic calculated from the aggregated BLAST [17] scores (using positives based on BLOSUM62 matrix [22]). For full-length alignment, it can be reduced to 1Sijmin(Sii, Sjj).However, when lengths are different; distance (1) avoids penalizing nonaligned ends of the proteins, taking into account only mutation events.

2.5.2. Genomic distances

Suppose that genomes iand jhave Nijatypes of markers found in both genomes, with Nijhof them having acceptable BLAST hits.

Define the offset Δij=max(3, Nijh4, 1+NijaNijh).Order marker distances in the ascending order: dij(0)dij(1)...dij(Nijh1). Then robust genomic distance is defined by the formula:

Dij=p=Δijp=NijhΔij1 dij(p)lij(p)p=Δijp=NijhΔij1 lij(p),E2

where lij(p)are corresponding alignment length. The marker-protein distances are weighted by alignment lengths lij(p)in order to provide where possible results similar to the original method in [8] based on concatenation of proteins. However, the use of offset Δijallows filtering out outliers since the averaging in (2) is performed over Nijh2Δijdistances in the middle. For each phylum level group, an agglomerative hierarchical clustering tree is built using the complete linkage clustering algorithm [19, 20].

2.6. Genome clustering pipeline

The pipeline for calculating genome clades consists of three major components (see Figure 1). The first is the collection of the input data from NCBI main sequence repositories. The genomic data are dynamic: hundreds of new genomes and assembly updates are submitted to NCBI each day. We create a snapshot of all live genome assemblies and their nucleotide sequence components (chromosomes, scaffolds, and contigs) and store them in an internal relational database: UniCol. The genome data set is organized into large groups (phyla and superphyla defined by NCBI Taxonomy). The assemblies are then filtered by quality and passed to the processing script. Ribosomal protein markers are predicted in every genome to overcome problems with the genome annotations (missing and/or incorrect annotations) and to normalize markers’ data set. Marker predictions are performed by aligning reference protein markers against full genome assemblies. Assemblies with at least 17 markers are passed to the next step. Genome distance is calculated as an average of pairwise protein distances of markers shared in a pair of genomes. Finally, agglomerative hierarchical clustering trees are built within phylum-level groups. Clades at the species level are calculated using species-aware algorithm. Superclade trees are constructed by sectioning the trees at the distance of 0.25.

Figure 1.

Dataflow of ribosomal-marker-based clade (genome group) processing. Ribosomal markers (in green) are maintained outside of the main pipeline (in blue). Clades and markers are available on NCBI FTP site:

2.7. Clades and superclades

Due to biological, historical, and sampling reasons, microbial organisms have very different levels of strain variation within species. Using the genome data available in public archives we have calculated the diameter of the species defined by NCBI Taxonomy (see Figure 2).

Figure 2.

Distribution of Taxonomy-defined species diameter. Y axes – diameter of species, X axes – species numbered in the descending diameter order.

Instead of using one fixed threshold, we utilize a taxonomy-aware algorithm that allows increasing the size of a genomic group in certain circumstances. Two distance threshold, the lower threshold d_lower and the upper threshold d_upper, are established (currently, we use values d_lower = 0.015 and d_upper = 0.025). Genomes with the lowest common ancestor with height d_lower or below are always in the same group, while genomes with the lowest common ancestor with height above d_upper are never placed together. In between d_lower and d_upper, taxonomic information is used: two subgroups are merged in a larger group if any pair of species in a group is already together in one of two subgroups (i.e., there are no new merges of species). Species are defined according to the NCBI taxonomic records [16].

Figure 3.

Superclade tree for three abundant groups: A – Salmonella, B – Bacillus, C – Streptococcus. Green boxes represent clades; box size is proportional to the number of genome in a clade.

Phylum-level trees are not practical for presentation and evaluation of closely related genomes. However, it is important to see the relationships (distance) between close clades (see Figure 3).

2.8. Genome groups

Species-level clades are further refined by whole-genome alignments using megablast with default parameters [18]. The genome groups are defined by clustering the genomes at 95% identity and 90% coverage. An example of genome groups for Klebsiella pneumoniaclade is shown in Figure 4. For each group a representative genome with the highest level of assembly and annotation quality is selected.

Figure 4.

Klebsiella pneumoniaclade contains 534 full genome assemblies organized in 25 closely related genomic groups. Blue circles at the end of the branch represent a single genome; green boxes represent a group of genomes with the box size proportional to the number of genomes.


3. Results and discussion

Large clades obtain additional members in each subsequent snapshot (see Figure 5). The process assigns related genomes to the same clade consistently. There is also a large growth in singleton clades, reflecting an increasing interest in sequencing taxonomically distinct organisms.

Figure 5.

Clade growth in four sequential snapshots.

We have developed an infrastructure for grouping all whole-genome sequence assemblies at various proximity levels. By using universally conserved ribosomal genes we define the species-level groups. We propose a set of 23 single-copy marker gene families that have consistent evolutionary histories. The proposed ribosomal protein-marker distance and genomic distance are tailored to achieve robustness, while remaining appropriately sensitive.

The major objective of our approach is to generate and actively maintain the target sets for pan-genome analysis. These ribosomal-marker-based groups (clades) roughly correspond to the species level as defined by NCBI Taxonomy. The subclades are calculated to show the closeness of the groups at the higher level. The relationship within the species-level group is further refined with whole pairwise genome alignment performed by megablast [18]. Tight genomic groups are defined at the level of 95% identity over the 95% genome coverage. By using the representative genomes from the tight groups, we can reduce the redundancy in comparative genomic studies. Other targets can be used for more refined population variation studies within species or SNP analysis for pathogen outbreak detection. These target sets require more accurate distance measure such as whole genomic alignments, K-mer distance [21].

3.1. Clades and species

Using a taxonomy-aware clustering algorithm does not completely solve the discrepancies between the species-level clades and traditional species. Genome sequences provide great opportunity to refine the classical taxonomic description of prokaryotes [23]. All cases of discrepancy were manually evaluated; most of them have been resolved by literature support. Some examples are described below.

3.1.1. Different species merged into a single clade

Escherichia coliand some Shigellaspecies are combined in a single clade by ribosomal marker distance. Shigella, which is recognized as a genus with four species in most situations, taxonomically belongs to the diverse E. coligroup, but the genus-level distinction has been retained due to historical recognition of its medical significance. Shigellahas adapted to higher primates as the only natural hosts.

The genus Brucellaconsists of 10 classically recognized species [] based on antigenic/biochemical characteristics and primary host species: Brucella abortus(cattle); Brucella canis(dogs); Brucella ceti(marine mammals); Brucella inopinata; Brucella melitensis(sheep and goats); Brucella microti; Brucellaneotomae(rodents); Brucella ovis(sheep); Brucella pinnipedialis(marine mammals); Brucella suis(swine, cattle, rodents, wild ungulates), and recently described in [24] Brucella papionisisolated from baboons (Pappiospp.). The wave of Next-Generation Sequencing brought in almost a hundred new isolates from a population of Brucella, which are clearly distinct from currently recognized species that are tentatively designated at the species level. These unnamed isolates have not yet been characterized using traditional methods, or the species name has not yet been validly published. Brucellagenus–level clade is shown in Figure 6.

Figure 6.

Ribosomal-marker-based clade comprises various species of Brucella. The pairwise genome distance is defined by the number of shared proteins in the core set of Brucella pan-genome. Green dots – proteins present in CORE set; red dots – proteins absent in CORE set.

Single species represented by multiple clades

Prochlorococcusand marine Synechococcusorganisms are small marine cyanobacteria, their genomes are characterized by small size and an evolutionary trend toward low GC content [25]. Whereas many shared derived characters define Prochlorococcusas a clade, many genome-based analyses recover them as paraphyletic. The single species, Prochlorococcus marinus, comprises six named ecotypes. Our ribosomal marker analysis and whole-genome alignment (described above in section on Methods) analysis suggests that this species should be represented by 11 different clades (see Figure 7.) These results are supported by recent genomic analysis of the genus of Prochlorococcus[26].

Figure 7.

Prochlorococcus marinusinterspecies diversity. The dendrogram is calculated using blast genome alignment score (%identity). The leaf nodes displayed as circles represent genomes of individual isolates/strains.

Novel species from noncultured not-isolated single cell and metagenome assemblies and new unclassified isolates (<genus> sp.) from clinical and epidemiological studies can be organized in hierarchical groups by genome sequence comparison methods. These groups can be used for downstream analysis: 1) pan-genome by clades not species; 2) groups of closely related genomes below species that can be calculated by nucleotide whole-genome comparison like K-mer or BLAST; 3) classification validation; 4) visualization of large data sets by selecting the genome representatives. Some of the applications marker-based clades and tight genome groups have been previously briefly described in [27,28].


4. Conclusions

No matter how impressive the numbers of genome sequencing projects are, they represent a miniscule fraction of the total number of bacterial species. The future genomic analysis tools will have to take into consideration the uncertain origin of the DNA sequences during analysis. Making sense of genomic data is one of the goals that are aided by the genome clustering procedure. The hierarchical infrastructure provides the foundation for further development of genome analysis and visualization tools.



This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. The authors are thankful to David J. Lipman, James Ostell, Scott Federhen, Michael Galperin, Eugene Koonin, Yury I. Wolf for productive discussions and to Vyacheslav Brover for the database support.

© 2016 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Leonid Zaslavsky, Stacy Ciufo, Boris Fedorov, Boris Kiryutin, Igor Tolstoy and Tatiana Tatusova (January 14th 2016). Dealing with the Data Deluge – New Strategies in Prokaryotic Genome Analysis, Next Generation Sequencing - Advances, Applications and Challenges, Jerzy K Kulski, IntechOpen, DOI: 10.5772/62125. Available from:

chapter statistics

1776total chapter downloads

1Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Reaping the Benefits of Next-generation Sequencing Technologies for Crop Improvement — Solanaceae

By Sushil Satish Chhapekar, Rashmi Gaur, Ajay Kumar and Nirala Ramchiary

Related Book

First chapter

Areas of Endemism: Methodological and Applied Biogeographic Contributions from South America

By Dra Dolores Casagranda and Dra Mercedes Lizarralde de Grosso

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us