Summary of the five testing datasets.
Scaffolding is an important step of the genome assembly and its function is to order and orient the contigs in the assembly of a draft genome into larger scaffolds. Several single reference-based scaffolders have currently been proposed. However, a single reference genome may not be sufficient alone for a scaffolder to correctly scaffold a target draft genome, especially when the target genome and the reference genome have distant evolutionary relationship or some rearrangements. This motivates researchers to develop the so-called multiple reference-based scaffolders that can utilize multiple reference genomes, which may provide different but complementary types of scaffolding information, to scaffold the target draft genome. In this chapter, we will review some of the state-of-the-art multiple reference-based scaffolders, such as Ragout, MeDuSa and Multi-CAR, and give a complete introduction to Multi-CSAR, an improved extension of Multi-CAR.
- multiple reference genomes
Due to recent advances in next-generation sequencing (NGS) technologies, more and more genomes of organisms can be sequenced quickly at a moderate cost . However, assembling a large number of reads generated from current NGS sequencing platforms into a complete genome still is a challenging job . Largely because of repetitive sequences, whose lengths are often larger than those of the reads, most of assembled sequences are just draft genomes that usually consists of several hundreds or even thousands of contigs (contiguous sequences). The availability of complete genomes actually is significant to the downstream analysis and interpretation of their sequences in many biological applications . To further obtain more complete sequences of draft genomes, therefore, the contigs of the draft genomes usually are required to be ordered and oriented into scaffolds, which actually are larger gap-containing sequences whose gaps between the scaffolded contigs can be closed later in the gap-filling process .
The scaffolding process utilizes a genomic sequence available from a related organism to serve as a reference to scaffold the contigs of a draft genome. So far, many such reference-based scaffolders have been proposed [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. The algorithms used to develop all these scaffolders can be classified into two main categories: the alignment-based algorithms [5, 6, 7, 8, 9, 10] and the rearrangement-based algorithms [11, 12, 13, 14]. The alignment-based scaffolding algorithms first align contigs in a target draft genome against a reference sequence and then scaffold the contigs according to the positions of their matches in the reference. On the other hand, the rearrangement-based scaffolding algorithms utilize the concept of genome rearrangements to scaffold the contigs of the target draft genome such that the sequence markers (or genes) shared between the scaffolded target and reference genomes have similar order and orientation as much as possible.
In some cases, it may be insufficient for a scaffolder to utilize only one single genome as the reference for correctly computing the scaffolds of a target draft genome, in particular when the target and reference genomes have a distant phylogenetic relationship or they have undergone some kinds of rearrangements, such as reversals, transpositions, block-interchanges and translocations. This situation inspires the requirement for developing multiple reference-based scaffolders, expecting that they can refer to several different but complementary genomes to order and orient the contigs of the target genome.
2. State-of-the-art multiple reference-based scaffolders
Ragout (Reference-Assisted Genome Ordering UTility) is a rearrangement-based scaffolder for ordering and orienting the contigs of a draft genome using multiple reference genomes . The input of Ragout includes a target draft genome, multiple reference genomes, and a phylogenetic tree between them. Ragout uses different colors to display the target and reference genomes and further represents all of these genomes as sequences of synteny blocks. Ragout then creates a so-called incomplete multi-color breakpoint graph, in which vertices represent the ends of synteny blocks and edges denote adjacencies of two synteny blocks occurring in the target and reference genomes. For the purpose of distinction, the edges are also colored by Ragout using the colors of the corresponding genomes. Because the target genome is already fragmented into contigs, some adjacencies of synteny blocks in the target genome are missing. Ragout tries to recover these missing adjacencies by using other existing adjacencies from the reference genomes. In the recovery process, Ragout computes the parsimony costs of all possible missing adjacencies by solving a so-called half-breakpoint state parsimony problem on the given phylogenetic tree, which actually is an NP-hard (non-deterministic polynomial time-hard) problem, meaning that it is hard to compute its optimal solution in polynomial time. Therefore, a heuristic approach is applied by Ragout to calculate the approximate parsimony costs of all the missing adjacencies. A perfect matching with minimum cost is then computed by Ragout on a graph created by using the missing adjacencies and is further used to scaffold the contigs of the target genome. Actually, the above procedure is repeated by Ragout multiple times with using different sizes of synteny blocks and moreover the scaffolding results obtained from all these iterations are then combined into a single set of scaffolds. Finally, a refinement is performed by Ragout to insert a number of small but repetitive contigs back to the resulting scaffolds.
MeDuSa (Multi-Draft based Scaffolder) is a multiple reference-based scaffolder that does not require a given phylogenetic tree for the target and references genomes . From the given target and reference genomes, MeDuSa constructs a so-called scaffolding graph, which denotes by vertices the contigs of the target genome and by edges the adjacencies between any two contigs when they can be mapped to the reference genomes. Moreover, each edge in the scaffolding graph is associated with a weight to represent the number of reference genomes supporting the existence of the edge. As a result, it is not hard to see that a path cover, which is a set vertex-disjoint paths covering all the vertices of the scaffolding graph, denotes a set of scaffolds in the target genome. Unfortunately, however, finding a path cover of maximum weight in a graph is already known as an NP-hard problem. Therefore, MeDuSa utilizes a 2-approximation algorithm to find an approximate path cover from the scaffolding graph. Finally, MeDuSa applies a majority rule to determine the orientations of contigs on each path of the approximate path cover.
Multi-CAR (Multiple reference-based Contig Assembly using Rearrangements) is multiple-reference version of CAR (Contig Assembly using Rearrangements) . CAR actually is a single reference-based scaffolder that utilizes a complete reference genome to scaffold the contigs of a target draft genome . Like MeDuSa, Multi-CAR does not require prior knowledge concerning phylogenetic relationships among target and reference genomes. However, in contrast to Ragout and MeDuSa, both attempting to solve an NP-hard problem in their scaffolding processes, the algorithm behind Multi-CAR involves only polynomially solvable problems, as described as follows. First, Multi-CAR utilizes CAR to compute a single reference-derived scaffolding result for a target draft genome based on each of multiple reference genomes. Second, Multi-CAR uses all single reference-derived scaffolds to build an edge-weighted contig adjacency graph. In this contig adjacency graph, the vertices denote extremities of contigs (i.e., each contig is represented by two vertices) and the edges represent whether two contigs are ordered consecutively in a scaffold returned by CAR based on a single reference genome (if so, the adjacent extremities of these two contigs are connected by an edge). In addition, if there are multiple reference genomes to support an edge connection, then this edge will be assigned a weight that equals to the sum of the weights of the supporting reference genomes. The weight of each reference genome is given by the users in advance; otherwise, it is defaulted to one. Third, Multi-CAR continues to find a maximum weighted perfect matching from the contig adjacency graph. Finally, Multi-CAR constructs a multiple reference-derived scaffold for the target draft genome according to the maximum weighted perfect matching.
3. A recent multiple reference-based scaffolder
In this section, we give a detailed introduction to a recent multiple reference-based scaffolder, called Multi-CSAR (Multiple reference-based Contig Scaffolder using Algebraic Rearrangements), which is an improved extension of Multi-CAR . Unlike Ragout and MeDuSa, Multi-CAR actually can not accept incomplete genomes as references, which greatly limits the widespread adoption of Multi-CAR because complete reference genomes are not always available for a target draft genome in practical usage . In addition, the weight of all reference genomes used by Multi-CAR must be assigned by the users; otherwise, they are defaulted to one. However, it is usually not easy for the ordinary users to correctly determine these weights. Therefore, Multi-CSAR has been developed to further overcome these limitations of Multi-CAR. In principle, the main steps of the algorithm in Multi-CSAR is the same as those in Multi-CAR, except that Multi-CSAR utilizes CSAR , instead of CAR , to compute the single reference-derived scaffolding result for the target draft genome, and also designs a sequence identity-based weighting scheme to automatically derive the weights of all the reference genomes. CSAR actually is an improved version of CAR and their main difference in usage is that the reference genome used by CAR needs to be complete, but the one used by CSAR can be incomplete.
3.1 Algorithm of multi-CSAR
Suppose that denotes a target draft genome with contigs and denote reference genomes with weights , respectively. Contigs actually are fragmented linear DNA sequences with two extremities, called head and tail, respectively. Multi-CSAR performs the following steps to scaffold the contigs in the target genome using the multiple reference genomes . First, Multi-CSAR utilizes CSAR to obtain a single reference-derived scaffold of based on each , where . Second, Multi-CSAR constructs a contig adjacency graphsuch that there are two vertices and for representing the head and tail of each contig , respectively, and there also is an edge for linking any two vertices if they are the extremities coming from the different contigs. An edge in is said to be supported by a reference genome if its two vertices are adjacent extremities from two distinct but continuous contigs in scaffold . If an edge in is supported by several reference genomes at the same time, then this edge receives a weight equal to the sum of the weights of all these supporting reference genomes. However, if an edge in is not supported by any reference genome, then it has a weight of zero. Third, Multi-CSAR utilizes the Blossom V  to find a maximum weighted perfect matching in , where a subset of edges in is called a perfect matching if every vertex in is incident to exactly one edge in this subset. Let and denote a subset of (i.e., ) with the minimum weight such that there is no cycle in . Finally, Multi-CSAR makes use of the edge connections in to scaffold the contigs of . Figure 1 displays an example for illustrating how the algorithm of Multi-CSAR works.
Note that CSAR was developed based on on a near-linear time algorithm  and Blossom V based on an -time algorithm , where is the number of vertices in a graph. Therefore, all the steps in the Multi-CSAR algorithm described previously can be implemented in polynomial time. In addition, Multi-CSAR utilizes the following sequence identity-based weighting scheme to automatically compute the weights of the reference genomes. First, Multi-CSAR applies either NUCmer or PROmer for identifying those sequence markers that actually are aligned regions between the target genome and each reference genome , where . Note that both NUCmer and PROmer come from the MUMmer package . The main difference between NUCmer and PROmer is that the former finds the sequence markers directly on input DNA sequences, while the latter recognizes them on the six-frame protein translation of the input DNA sequences. Suppose that there are sequence markers, say , between and , and and are used to denote the alignment length of each and its percent identity, respectively. Next, Multi-CSAR calculates the weight of each reference genome by the formula . The principle of the sequence identity-based weighting scheme is that the more similar the reference genome is to the target genome , the more weight receives.
3.2 Usage of multi-CSAR
Currently, Multi-CSAR offers a web server2 with an easy-to-operate interface (see Figure 2) to the users. To run Multi-CSAR, the users first need to upload a target genome and one or more reference genomes in multi-FASTA format. If needed, the users can click the “plus” (respectively, “minus”) button to add (respectively, remove) a reference genome field. Second, the users can determine whether or not to utilize the sequence identity-based weighting scheme provided by Multi-CSAR for automatically calculating the weights of reference genomes. If the weighting scheme is not used, then the weights of all the reference genomes are defaulted to one. Third, the users can choose either NUCmer or PROmer to identify sequence markers between the target genome and each of the reference genomes. Fourth, the users can enter an email address, which is optional, if they would like to run Multi-CSAR in a batch way. When running Multi-CSAR in this batch way, the users will be notified of the scaffolding result via email when the submitted job is finished by the web server of Multi-CSAR.
Multi-CSAR outputs its scaffolding results in four tab pages: (a) input data & parameters, (b) Circos plot validation, (c) dotplot validation, and (d) scaffolds of target. In the “Input data & parameters” page (see Figure 3 for an example), Multi-CSAR simply shows the information of the input target and reference genomes, the user-specified program (either NUCmer or PROmer) for identifying their sequence markers, and whether the weighting scheme of reference genomes is used or not. By clicking on the links of the target and reference genomes in this page, Multi-CSAR will also display their input DNA sequences. By clicking on the link “Dotplot against target genome” on the reference genomes, Multi-CSAR will display a dotplot that allows the users to visually inspect sequence markers shared between un-scaffolded target genome and a reference genome. In the dotplot (see Figure 4 for an instance), the un-scaffolded target genome and a selected reference genome are represented on the and axes, respectively. Note that the contigs and scaffolds in the dotplot are separated by horizontal and vertical dashed lines. Moreover, each forward (respectively, reverse) sequence marker is shown by a red (respectively, blue) line and its begin and end are represented by two unfilled circles. The users can sort the contigs of the input target genome based on their sizes by clicking on the toggle switch “Sort by contig size”. The users also can show or hide the IDs of contigs and scaffolds used in Multi-CSAR by using the toggle switch “Show scaffold/contig IDs.” The format of contig (respectively, scaffold) IDs begins with three-letter prefix CTG (respectively, SCF) followed by an underscore (_) and at least one digit (e.g., CTG_1 and SCF_1). In addition, the users can click the “Save as SVG file” button to download a copy of the dotplot in scalable vector graphics (SVG) format.
In the “Circos plot validation” page, (see Figure 5 for an example), Multi-CSAR displays its total running time, as well as its scaffolding result by a Circos plot between scaffolded target genome and all reference genomes. In the initial Circos plot, the scaffolds of target genome (displayed in purple) and all the reference genomes (displayed in other colors) are arranged in a circle with the inner links connecting corresponding sequence markers between the target genome and each of reference genomes. The color of an inner link comes from the reference genome it connects. In the Circos plot, the number of crossing inner links can be viewed as a accuracy measure for a scaffolding result. That is, if the contigs of the target genome are scaffolded well according to a reference genome, the number of crossing inner links between them should be low. For this purpose, Multi-CSAR allows the users to select any reference genome (by clicking the checkbox next to it) from the top of the tab page to display (by clicking the “Display Circos plot” button) its Circos plot against the scaffolded target genome (see Figure 6 for an instance). In this Circos plot, the inner circle displays the sequence markers shared between the target genome and the selected reference genome. As demonstrated in Figure 6, the Circos plots of the scaffolding result are convenient and helpful for the users to visually validate whether the contigs of the target genome are properly scaffolded according to the reference genomes, as well as to visually identify whether there are any genome rearrangements between the scaffolded target and reference genomes. In addition, Multi-CSAR allows the users to the Circos plots of the scaffolds in portable network graphics (PNG) format by clicking the “Save as PNG file” button.
In the “Dotplot validation” page (see Figure 7 for an example), Multi-CSAR displays its its scaffolding result by a dotplot between the scaffolded target genome and a selected reference genome (the default is the first reference genome). In fact, the matched sequence regions of sequence markers should be displayed from the bottom left to the top right in the dotplot (as shown in Figure 7) or from the top left to the bottom right, if the contigs from the target genome are scaffolded perfectly based on the selected reference genome. Showing the scaffolding result in the dotplot display is another way to conveniently help the users to visually verify whether the contigs of the target genome are scaffolded properly based on the reference genomes or not. The users can click the “Save as PNG file” button to download the dotplot of a scaffold in portable network graphics (PNG) format.
In the “Scaffolds of target” page (see Figure 8 for an instance), Multi-CSAR displays its scaffolding result in tabular format for the purpose of allowing the users to view the scaffolds of the target genome in detail. The scaffolds in the table are sorted according to their sizes, which equals to the sum of contig sizes. In each scaffold, the ordered contigs, as well as their orientations (forward orientation denoted by 0 and reverse orientation by 1), sequences and lengths, are listed in a table. The users can click on the “Download scaffolds (.txt)” and “Download scaffolds (.csv)” buttons to download the scaffolds of the target genome in the tab-delimited text format and comma-delimited CSV format, respectively. In addition, the users can click on the “Download sequences” button to download the scaffold sequences in the text format, in which the sequences of contigs are separated by 100 Ns if they belong to the same scaffold.
4. Results and discussion
4.1 Testing datasets
The three multiple reference-based scaffolders Multi-CSAR, Ragout (version 1.0) and MeDuSa (version 1.6), we introduced in this chapter, were tested on a benchmark of five real bacterial datasets as shown in Table 1. In fact, these five testing datasets were originally prepared by Bosi et al. when they studied MeDuSa . Basically, each testing dataset consists of a target draft genome to be scaffolded and two or more reference genomes that can be either complete or incomplete.
|Organism||No. of replicons||No. of contigs||No. of references||Genome size (Mbp)||GC%|
|B. cenocepacia j2315||4||1223||4||8.05||65.9|
|E. coli K12||1||451||25||4.64||50.8|
|R. sphaeroides 2.4.1||7||564||2||4.60||67.4|
4.2 Evaluation metrics
For each testing dataset, Bosi et al.  also provided a reference order for the contigs of the target genome that can be used a truth standard to evaluate the multiple reference-based scaffolders. The evaluation metrics of the scaffolders include sensitivity, precision, , genome coverage, NGA50, scaffold number and running time. Basically, sensitivity, precision and are used to estimate the scaffold accuracy, genome coverage to estimate the scaffold coverage, and NGA50 and scaffold number to estimate the scaffold contiguity. Below, we introduced their detailed definitions.
Note that if any two contigs in a scaffold appear in continuous order and correct orientation in the reference order, then they are viewed as a correct join. Let denote the result obtained by applying a scaffolder on a target genome and denote the number of all contig joins in the reference order. The number of the correct contig joins in is then called as true positive () and the number of the others (i.e., incorrect joins) as false positive (). In addition, the sensitivity of is defined as , its precision as , and its as . Actually, is a balanced measure between sensitivity and precision and is high only when both sensitivity and precision are high.
Suppose that the target genome contains only circular DNAs and is a contig in . If the both sides of are joined correctly with two contigs, then the whole length of will be counted in the genome coverage that will be defined later. If exactly one side of is joined correctly with one contig, then half of the whole length of will be counted. If the both sides of are joined incorrectly with two contigs, then the whole length of will be ignored. Based on the above discussion, the genome coverage of is defined to be the ratio of the sum of the contig lengths counted according to the above-mentioned rules to the sum of all contig lengths. On the other hands, suppose that there are linear DNAs in the target genome . Then in the reference order of each linear DNA, the first and last contigs have just one neighbor contig and thus only half of their lengths will be counted in the calculation of the genome coverage if these two contigs are correctly joined with neighbor contigs.
The NGA50 value of is computed as follows . First, the scaffolds of are aligned with the complete sequence of the target genome to find the mis-assembly breakpoints. Second, the scaffolds of are broken at the mis-assembly breakpoints and their unaligned regions are also removed. Finally, the NGA50 value is equal to the NG50 value of the resulting scaffolds, which is the size of the shortest scaffold with longer and equal length scaffolds covering at least 50% of the target genome.
4.3 Comparison of multiple reference-based scaffolding results
All the three evaluated scaffolders Multi-CSAR, Ragout (version 1.0) and MeDuSa (version 1.6) were all run with their default parameters, except that a star tree was used in Ragout to serve as the phylogenetic tree for each testing dataset because reliable phylogenetic trees were still unknown. Table 2 displays their average performance results over the five bacterial datasets, by showing the values of sensitivity (Sen), precision (Pre), and genome coverage (Cov) in percentage (%) and the size of NGA50 in base pairs (bp). In addition, Table 2 shows the numbers of scaffolds computed by all evaluated scaffolders in the column ‘#Scaf’ and their running times in minutes in the column ‘Time’. The best result in each column of Table 2 is shown in bold.
As shown in Table 2, Multi-CSAR running with NUCmer achieves the best sensitivity, , genome coverage, NGA50 and running time, and the second best precision and scaffold number. On the other hands, Multi-CSAR running with PROmer has the best result in terms of scaffold number and the second best results in terms of sensitivity, , genome coverage and NGA50. From the precision point of view, the performance of Ragout is the best among all the tested multiple reference-based scaffolders. However, the sensitivity of Ragout is substantially inferior to that of Multi-CSAR when either running with NUCmer or PROmer. This negative result also leads to that Ragout is much inferior to Multi-CSAR in the performance of . Moreover, Ragout yields the worst results in terms of both scaffold number and running time. Compared Multi-CSAR and Ragout, MeDuSa gives the worst performance in sensitivity, precision, , genome coverage and NGA50, although it has the second best performance in running time.
Table 3 shows the average performance results of Multi-CSAR on the five bacterial datasets when using the sequence identity-based weighting scheme, where the best performance in each column is also displayed in bold. As compared to the results of Multi-CSAR as shown in Table 2, several performance measures of Multi-CSAR can be further improved if it is run with the sequence identity-based weighting scheme of reference genomes, such as sensitivity, precision, , genome coverage and NGA50.
Scaffolders are useful tools for sequencing projects to obtain more complete sequences of genomes being sequenced. In this chapter, we mainly introduced some state-of-the-art multiple reference-based scaffolders, such as Ragout, MeDuSa and Multi-CSAR (improved extension of Multi-CAR), that can efficiently produce more accurate scaffolds of a target draft genome by referring to multiple complete and/or incomplete genomes of related organisms. By testing on five real prokaryotic datasets, Multi-CSAR outperforms Ragout and MeDuSa in terms of average sensitivity, precision, , genome coverage, NGA50, scaffold number and running time. Currently, Multi-CSAR provides the users with a web interface that is intuitive and easy to operate. In addition, it displays its scaffolding result in a graphical mode that allows the users to visually validate the correctness of scaffolded contigs and in a tabular mode that allows the users to view the details of scaffolds.
This work was partially supported by Ministry of Science and Technology of Taiwan under grants MOST107-2221-E-007-066-MY2 and MOST109-2221-E-007-086.
Conflict of interest
The authors declare no conflict of interest.
- The web server of Multi-CSAR is available at http://genome.cs.nthu.edu.tw/Multi-CSAR/.