Metagenomics-Based Phylogeny and Phylogenomic

Phylogenetic relationships among microbial taxa in natural environments provide key insights into the mechanisms that shape community structure and functions. In this chapter, we address the current methodologies to carry out community structure profiling, using single-copy markers and the small subunit of the rRNA gene to measure phylogenetic diversity from next-generation sequencing data. Furthermore, the huge amount of data from metagenomics studies across the world has allowed us to assemble thousands of draft genomes, making necessary the comparison of whole genomes composites through phylogenomic approximations. Several computational tools are available to carry out these analyses with considerable success; we present a compendium of those open source tools, easy to use and with modest hardware requirements, with the aim that they can be applied by biologists non-specialists to study microbial diversity in a phylogenetic context.


Introduction
Next-generation sequencing technologies have transformed our perception of diversity and microbial distribution in natural ecosystems and have contributed substantially to the discovery of totally new microbial landscapes in such distinctive environments as the gut of mammals, the vegetal rhizosphere, vascular tissues of higher plants, and even in volcanic lakes [1][2][3]. There are two general approaches to profile microbial communities through next-generation sequencing techniques: shotgun sequencing of total DNA isolated directly from the environment and sequencing of variable regions coming from SSU-rRNA genes (we know these approaches as metagenomics methods since all involve the culture-independent genomic analysis of microbiomes on a particular environment [4,5]). Both approaches have been widely used to trace microbial diversity at increasingly fine taxonomic levels, either by capturing a representative fraction of the total gene content or by amplicon sequencing techniques like the popular bacterial 16S rRNA. Each method has advantages and disadvantages, and the selection depends on several factors like taxonomic level resolution, cost, sensitivity, and primer bias, among others. One of the challenges associated with metagenomics methods is the analysis of massively generated data. Both the sequencing of amplicons and environmental DNA produces millions of short DNA sequences (reads), which must undergo preprocessing and quality control, before they can be used to extract biologically useful information from them. One of the goals of massively sequencing data analysis is to obtain the patterns of phylogenetic diversity in ecological communities, an important trait in order to assess the classic ecological questions "Who is there?" or "What they are doing?" and provide better understandings into the phylogenetic relationships among microbial community taxa. Extracting phylogenetic information from massive sequencing reads is not a trivial task; however, it can be achieved with reasonable success by using several profiling tools adapted both to the analysis of amplicons of ribosomal genes and to the conserved genes between different domains [6,7]. The microbial community structure has been approached mostly using the 16S SSU-rRNA gene as phylogenetic marker, mainly due to lower sequencing costs and an acceptable relation of specificity-resolution in taxonomic assignments [8], while methods that use single-copy markers obtained from shotgun sequencing reads or assembled samples are gaining relevance because they have demonstrated strain-level resolution [9,10], a really hard issue when analyzing complex microbiomes.
To date, several computational tools have been developed to carry out community profiling and phylogenetic inferences from next-generation sequencing data with considerable success. In this chapter we present a compendium of open-source tools and easy-to-use with modest hardware requirements, with the aim that they can be applied by biology non-specialists to study microbial diversity in a phylogenetic context. We show several practical examples explained step by step, in order to provide to the reader, the replication using their own data.
We have selected tools for use on a local computer through the Unix command line, and tools are available from dedicated servers, with easy access and intuitive use. The examples described in the chapter were tested on a Dell Optiplex 7010 desktop, 6T6ZYV1 Series, Intel (R) Core (TM) i5-3550 CPU at 3.30 GHz, Memory 12 GiB.

Community structure profiling across microbial samples using single-copy markers
With the advent of massive DNA sequencing technologies, several methods have been developed to assign shotgun reads to microbial taxonomic categories. These methods aim to perform a microbial community profiling that infers its relative structure, and they are very important to understand how microbiomes work in nature, their phylogenetic composition, and even their dynamics and evolutionary history. The starting point for these analyzes is a set of reads obtained by massive sequencing whose length is variable (as little as 50-75 bp up to >1000 bp) depending on the platform used (Illumina, Ion Torrent, PacBio RS). We can understand by a read the sequence of bases from a single discrete molecule of DNA, obtained in a massively parallel manner [11]. However, currently most metagenomics studies use a range of a short-read sequencing instruments between 100 and 600 bp in order to maximize counting reads and lower costs. These short-reads contain the genomic, phylogenetic, and functional information of the microbiome into millions of discrete DNA fragments, which are sufficient to make a reliable estimate of the phylogenetic diversity present in a microbial sample (Figure 1).
The taxonomic composition of a microbial community can be estimated from a set of short-reads by assigning each read to the most likely microbial lineage [12]. Historically, a single gene target approach has been the gold standard for assigning taxonomy in the Prokaryote domain, through the 16S ribosomal RNA gene. However, this presents important biases related to copy-number variations and significant intraspecific differences ~6%. In this sense, both clade-specific and universal single-copy phylogenetic markers genes have gained popularity among the scientific community since they are not subject to intragenomic diversity, are rarely subjects of horizontal transfer, and have proven robustness to delineate species and prokaryotic strains in multiple studies, because several genes can be combined to reconstruct phylogenies [13,14]. Although each method selects its own set of clade-specific or universal markers, most of these genes encode proteins with functional relevance in housekeeping metabolism ( Table 1). To make the analysis, the coding nucleotide sequences are generally used as they offer better resolution than amino acid sequences in closely related organisms [16]. This simplifies the computational analysis as the short-reads could be compared unambiguously without the need to translate them into proteins, which could generate artifacts given the small size of the reads.
One of the most popular tools for microbial profiling based on clade-specific marker genes is the MetaPhlAn classifier [12,17]. MetaPhlAn maps the experimental reads against a collection of 231 markers for species-level comparisons and >115,000 markers for higher taxonomic levels. Among the advantages of this classifier is that no preprocessing is required, so raw data can be uploaded and analyzed. The main disadvantage for non-specialists is that MetaPhlAn works through the command line in a Unix architecture.

Profiling a textile dye degrader microbiome with MetaPhlAn2
Next we described the steps performed for profiling a microbial community capable of degrading the textile dye HC Blue no. 2. Also we show a graphical representation of the profiling phylogenetic metadata. This general strategy can be applied to profile any microbial community from short-reads obtained by massive sequencing. Symbol convention: Comments (#); executable commands ($). The raw data are available on [18].

Clusters of orthologous groups of proteins (COG)
Protein name

Phylogenetic diversity of microbial communities based on 16S rDNA gen
Estimating the taxonomic and phylogenetic diversity of a microbial community is also possible through sequencing and analysis of small ribosomal RNA subunit (16S rRNA) gene, whenever this sequence has been considered for a long time a stable marker, crucial in the microbial systematics of the last 30 years. 16S ribosomal ribonucleic acid is a key component of the small subunit of prokaryotic ribosomes, central player in the cellular biology of microorganism; it serves as a linker for the process of translating genetic information to proteins [20]. Because DNA is much easier to sequence than RNA, DNA segment coding for 16 rRNA is obtained for the purposes of sequencing (Figure 3). This gene fragment meets several features that have made it a "quasi-gold standard" for bacterial taxonomy: • It is a ubiquitous gene in the Bacteria and Archaea domains.
• Within its ~1500 bp, it has discrete regions with enough variability to establish a phylogenetic signal among phyla and even genus.
• It has conserved regions that allow the design of "universal primers," a very useful feature in massive sequencing.
• It has several databases enriched with sequences from almost all international projects where 16S sequences are obtained (

16S community profiling by analysis of ribosomal amplicons
Microbial diversity is measured as a function that depends on the richness and abundance of distinct taxons among any community [25]. Obtaining representative DNA sequences from the entire community is essential to make valid inferences. Profiling a microbial community through 16S gene analysis generally consists of four steps (Figure 4). To date, several computational tools have been developed to analyze microbial communities through the 16S gene marker; however, estimating the total microbial diversity in any environment is a still a major challenge [6,[26][27][28], influenced by several factors, among them we want to mention two: (I) processing huge amounts of data moves within the limits of modern computing and (II) the need for some expertise that can cost years of training. Fortunately, many tools have been developed in recent years, aiming to make bioinformatics platforms dedicated to this type of analysis more human-friendly, and there are dedicated sites exclusively to deposit computational alternatives for almost all needs, for example, https://github.com/.   A good example of these multiplatforms to profile microbial communities is the Microbiome Taxonomic Profiling (MTP) pipeline from EzBioCloud site (https:// www.ezbiocloud.net/contents/16smtp) [24]. Among its fundamental advantages are: it is free, knowledge of Linux environment is not needed to carry out the analyses, and several types of outputs such as functional profiles, taxonomic and phylogenetic structure, as well as on-demand comparison with other published microbiome data are fully available. New users of EzBioCloud will be required to open a local server account (https://www.ezbiocloud.net/signup?from=addMTP); after that you can upload up to 100,000 reads for sample and begin the analysis. We list general steps to perform a profiling on the platform (Box 1).
The platform consists of a very intuitive and user-friendly presentation that guides the beginner user at every stage of the analysis. The first step is the uploading of the next-generation sequencing data (16S amplicon reads). After that, you can request for the MTP pipeline, and the analysis starts. In a relatively short time, you can access the result portal with the preprocessing results resumed in pre-filtered reads (by removing low-quality and chimeric amplicons), statistics about read lengths, and taxonomic read assignments at species level.
Other outputs in results portal are related with several diversity indices, taxonomic composition and hierarchy, and graphical implementations like Krona [29]. MTP implements seven different diversity indices; among them is the phylogenetic diversity index, a measure of biodiversity that considers phylogenetic difference between taxons and ponders several variables like taxonomic diversity and species abundances or distributions.

Extracting 16S sequences from assembled data
In occasions, we do not have a set of DNA short-reads, but assembled composites in contiguous regions of variable size. Such is the case of genomes assembled from metagenomes or contigs from complex metagenomes. Inferring taxonomic diversity from this type of data usually requires other strategies. One of the most useful is to predict all the rRNA sequences contained in the assembly and cluster them according to their identity (this implies making a list of nonredundant sequences) to define operative taxonomic units. A simple way to address this problem is through the use of Barrnap software [27]; it works through the Unix command line and has the advantage of consuming few computational resources, so that several complex microbiomes can be analyzed in a personal computer for extraction of rRNA sequences. Barrnap gives us an output with all predicted sequences; this includes 5S, 16S, and 23S rRNA in the case of bacteria. The sequences can be saved on-demand in a text file and subsequently analyzed by a third-party phylogenetic processing software to establish evolutionary relationships between taxa. A suitable platform for this objective is SeaView [28], which contains sequence alignment and curing utilities, as well as a set of phylogenetic reconstruction methods, like PhyML, which uses maximum likelihood algorithms and seven different evolutionary models. It is also possible to use distance methods such as Neighbor Joining and BioNeighbor Joining, both with seven different methods to calculate distances between sequences. The platform is open access and has the advantage of being a graphical application that works on Unix and Windows, as well as being very intuitive.

Open-source software for phylogenetic and phylogenomic surveys
Genome-based comparisons play an essential role in the current taxonomy and phylogenetic of Bacteria and Archaea domains and eventually will replace the single gene target approach ruled by 16S rRNA gene phylogeny. The exponential growth of complete genomes and genome drafts with significant completeness values and low contamination (<5%) in international databases has resulted in an approach to phylogenetic analysis where the whole information has become in a more conservative Box 1.  fingerprint of the taxonomic categories. The current challenges for science involve improving existing methods for data acquisition and processing, since comparative analysis, even among modest-sized microbial genomes, can be computationally expensive. Here we present a list of those open-source tools and easy-to-use and modest hardware requirements, with the aim that they can be applied by biologists to study microbial diversity in a phylogenetic context ( Table 4).

Conclusions
Profiling microbial communities from massive sequencing data constitutes a breaking point in the understanding of population structure and dynamics, their ecological functions and the complex relationships established between noncultivable microorganisms. Through technological developments such as nextgeneration sequencing and the developing of hundreds of open-access platforms, we have been able to better understand the role of the microbial world in natural ecosystems. This chapter intends to bring the use of computational biology tools to professionals in biological sciences with different expertise, interested in the world of metagenomics analysis. We have started with the basics of microbial community profiling through shotgun sequencing data and its processing using MetaPhlAn software (the reader will notice that there are other tools perhaps more appropriate to their conditions, an interesting option is the FOCUS software that works through a Web server). MetaPhlAn has the advantage of being fully integrated with the GraPhlan phylogenetic reconstruction tools. We dedicate a complete section to the 16S gene-based communities profiling; we illustrate the EzBioCloud platform, a useful tool to obtain ecological and phylogenetic information of microbiomes. An alternative approach to process assembled data is the use of Barrnap software, which is very fast and efficient to extract ribosomal sequences in assembled data, which can be subsequently clustered and processed with phylogenetic construction tools such as SeaView. Finally, we present a list of software that can serve as a guide for the analysis of microbiomes from their taxonomic characterization to the study of phylogenetic relationships between taxa.