The list of tools for different types of SV calling.
Genomic structural variations (SVs) are genetic alterations that result in duplications, insertions, deletions, inversions, and translocations of segments of DNA covering 50 or more base pairs. By changing the organization of DNA, SVs can contribute to phenotypic variation or cause pathological consequences as neurobehavioral disorders, autoimmune diseases, obesity, and cancers. SVs were first examined using classic cytogenetic methods, revealing changes down to 3 Mb. Later techniques for SV detection were based on array comparative genome hybridization (aCGH) and single-nucleotide polymorphism (SNP) arrays. Next-generation sequencing (NGS) approaches enabled precise characterization of breakpoints of SVs of various types and sizes at a genome-wide scale. Dissecting SVs from NGS presents substantial challenge due to the relatively short sequence reads and the large volume of the data. Benign variants and reference errors in the genome present another dimension of problem complexity. Even though a wide range of tools is available, the usage of SV callers in routine molecular diagnostic is still limited. SV detection algorithms relay on different properties of the underlying data and vary in accuracy and sensitivity; therefore, SV detection process usually utilizes multiple variant callers. This chapter summarizes strengths and limitations of different tools in effective NGS SV calling.
- genome organization
- next-generation sequencing
- structural variation
- variant calling
First, efforts in exploring genetic variations were focused on single-nucleotide polymorphisms (SNPs) which were initially considered the main source of genetic and phenotypic human variation , while larger variations were thought to be rare events. However, in 2004 two studies [2, 3] revealed an unexpectedly large amount of large-scale variations (several kb to hundreds of kb) in the human genome. The evidence for the prevalence of structural variants (SVs), such as deletions, duplications, and inversions, began to accumulate. By changing the organization of the DNA, SVs can contribute to the phenotypic differences among healthy individuals or cause severe phenotypic consequences. SVs are involved in a wide range of diseases and conditions, such as autism spectrum disorders [4, 5, 6], schizophrenia , Crohn’s disease , rheumatoid arthritis , lupus erythematosus , psoriasis , obesity , and cancers [13, 14]. Among the different classes of genetic variations, SVs have remained the most challenging to detect and characterize. SVs were examined since the identification of chromosomal abnormalities using classic cytogenetic methods, revealing changes down to 3 Mb. Later techniques for SVs detection are based on array comparative genome hybridization (aCGH) and single-nucleotide polymorphism arrays. Next-generation sequencing (NGS) has enabled methods for precise definition of breakpoints of SVs of different sizes and types. Characterization of SVs from high-throughput sequencing data presents complex task due to the volume of the data and short sequence reads.
2. Structural variations
Genomic structural variations (SVs) are genetic alterations that result in amplifications, losses, inversions, and translocations of segments of DNA greater than 50 bp. SVs are a normal part of genomic variation but can also cause disorders. Standard detection methods include chromosome banding, fluorescent in situ hybridization (FISH), and array comparative genome hybridization (aCGH) that is very useful to detect copy number variations (CNVs) but cannot detect copy-neutral SVs (inversions, balanced translocations) . Recent methods include employment of NGS to identify SVs, which are not detectable by cytogenetic methods.
Chromosomal rearrangements can occur on a single chromosome (interchromosomal SVs) or can involve exchange of genomic DNA between chromosomes (intrachromosomal SVs). Intrachromosomal SVs are a product of one or more double-strand breaks, which may result in deletions, inversions, and duplications. Deletions and duplications are copy number variations and are easily detected by employing NGS data (read coverage method), whereas inversions are copy number-neutral. Intrachromosomal translocation is the exchange of genetic material between two non-homologous chromosomes. In a reciprocal translocation, two broken-off pieces of two non-homologous chromosomes are exchanged, usually producing two balanced derivative chromosomes. Unless breakpoints disrupt important developmental genes, balanced translocations do not affect phenotype . However, during gamete formation such chromosomes may segregate in unbalanced manner or unbalanced translocations may occur de novo and lead to monosomy and trisomy of different chromosome segments , which account for approximately 1% of developmental delay and intellectual disability cases in human [17, 18, 19]. Robertsonian translocations are a type of SVs resulting from chromosome end breaks near centromeric regions of two acrocentric chromosomes and their reciprocal exchange, which results in one large metacentric chromosome and one very small chromosome that is usually lost without phenotype effect. In case three or more chromosomal breakpoints are involved, we speak of complex chromosome rearrangements, which may result in balanced or unbalanced state .
3. Next-generation sequencing
The first commercially available next-generation sequencing platform was released in 2005 . The technology has been continuously upgraded and has fundamentally changed the field of genetics studies. Next-generation sequencing (NGS), also known as high-throughput sequencing, parallelizes the sequencing process and produces millions of short reads (50–400 bp each) in a single experimental run. It has contributed to rapid progress in single-nucleotide polymorphisms detection. Due to the nature of the NGS short-read sequences, the category of longer variants remained poorly characterized. Variants in range 10–100 kb are small for detection by cytogenetic methods  but too large for reliable detection with short-read sequencing. SVs affect more bases than single-nucleotide polymorphisms  and present an important class of genetic variation. Moreover, many SVs have been shown to play relevant roles in phenotypic variability and disease .
3.1. NGS data analysis pipeline
Once the samples are sequenced, the NGS data analysis becomes the task in bioinformatics field. The computational analysis and interpretation of the data generated remains one of the major bottlenecks in NGS projects. The basic steps for analyzing NGS data are quality assessment, reads alignment (mapping) to a reference sequence, and variant identification. The second stage of analysis comprises variant analysis, visualization, and interpretation of the variants in relation to phenotypes. Commercial packages such as CLCBio Genomic Workbench, CASAVA, and SeqNext often provide all-in-one solutions, while academic pipelines typically consist of sequential tools for specific steps in the analysis.
The output from the sequencing machines are reads, which are usually stored in text-based FASTQ files. The data obtained from NGS are compromised by sequence artifacts, including read errors, poor-quality reads, and primer contamination . To avoid erroneous conclusions, the artifacts should be removed. A number of bioinformatics tools for sequencing quality assessment, such as FastQC, FASTX-Toolkit, PRINSEQ , TagDust , and NGS QC Toolkit  are designed. Next step in NGS data analysis is alignment of short reads to corresponding positions on a reference sequence. A variety of algorithms have been developed for this task. Representative read mappers are Bowtie2 , BWA , and Novoalign. The typical output from the read mapper is BAM file which contains information about qualities and positions of aligned sequences. Variant analysis consists of genotyping, variant calling, annotation, and prioritization. Genomic variants, such as SNPs and short-scale insertions and deletions are identified by variant callers. Widely used tools for variant calling are Genome Analysis Tool Kit HaplotypeCaller (GATK-HC) , Samtools mpileup , Freebayes and Torrent Variant Caller (TVC). Variant callers take in a BAM file and return a list of variants. To annotate variants, SnpEff , VariantAnnotator from the GATK , and ANOVAR  tools are used. To systematically filter, evaluate, and prioritize thousands of variants VAAST 2.0 , VarSifer , KGGseq , and commercial software Ingenuity Variant Analysis are available.
3.2. Single-read and paired-end sequencing
Initially, NGS technologies produced extremely short reads (25–36 bp), sequenced from only one end of the DNA (single-read sequencing) . As technology developed, read lengths consistently increased and sequencers have been improved to sequence both ends of a fragment with or without a non-sequenced stretch in between (paired-end sequencing). This not only has the benefit from doubling the number of reads but also improves accuracy and offers additional information for structural variants detection.
The reads obtained from paired-end sequencing (R1 and R2) come from the same fragment of DNA. The length of the fragment is usually longer than the length of reads (R1 + R2), so there is a gap between them (Figure 1). Although the sequence of the fragment between reads is not known, the knowledge that R1 and R2 are next to each other on the known distance and have opposite orientation is useful.
4. Overview of the structural variation detection algorithms
Using NGS technologies, large volume of sequence data at an unprecedented speed and constantly reducing cost is produced. Consequently, the computational tools for analysis of massive amounts of genomic data are in demand. There is a growing awareness that structural variations represent a significant contribution to genotypic and phenotypic diversity . However, the accurate detection of structural variants from NGS is a daunting task . A number of algorithms have been proposed to address the issue of structural variants calling from NGS data . SV detection algorithms rely on different properties of the underlying data and vary in accuracy and sensitivity. The algorithms follow one or a combination of strategies, which could be classified into categories: (1) read depth (RD), (2) paired-end (PE), (3) split reads (SR), and (4) de novo assembly (AS). The most suitable method for SV detection depends on the size and variant type as well as characteristics of the sequencing data . SV detection process usually utilizes multiple variant callers.
4.1. Algorithms based on read depth
Read depth (RD) algorithms are able to identify CNVs. RD-based algorithms can accurately predict absolute copy-numbers  but are unable to detect copy-number neutral variants such as inversions and balanced translocations. The breakpoint identification resolution is low and depends on the sequence coverage.
RD algorithms divide the reference sequence in intervals and calculate the number of reads aligned within them. The read depth per interval should follow a normal distribution centered at the average read depth for the entire reference sequence. When the read depth of contiguous intervals significantly differs from the average observed, the CNV is detected (Figure 2). Deleted regions show reduced read depth when compared to entire reference sequence (Figure 3).
4.2. Paired-end approaches
Paired-end sequencing data allow detection of many types of SVs. Paired-end (PE) SV calling approaches detect deviations from expected library insert size (donor reads map at inconsistent distances). When a pair of reads does not overlap with any SV, the distance between them is the same as the size of the library insert and reads have correct orientation (concordant pairs). When the read pair overlaps a SV, the mapping distance of paired reads differs from the library insert size and their orientation may be inverted. Discordantly mapped paired-reads can be (1) further apart than expected, (2) closer together than expected, (3) in inverted orientation, (4) in incorrect order, (5) on different chromosomes. Clusters of read pairs aligned to the same genomic regions with the distance shorter than expected can be explained by insertion in the sequenced samples (donor). Larger distances between reads than expected can be explained by deletion in the sample (donor) (Figure 4). The resolution of the breakpoints detected by this approach depends on the library’s insert size and on the read coverage. Insertions larger than the library insert size cannot be detected.
4.3. Algorithms based on split-reads
Split-read (SR) algorithm can detect SVs with a single base-pair resolution. Split reads contain the breakpoint of the structural variant. Their alignments to the reference genome are split into two parts (Figure 5). Parts of a read are independently aligned to the reference genome, so the reads should be long enough to be aligned uniquely. Therefore, algorithms based on split-reads are feasible only when the sequencing reads are sufficiently long.
4.4. Algorithms based on de novo assembly
Algorithms based on de novo assembly (AS) are able to detect all forms of structural variation. De novo assembly refers to reassembling the original sequence from which the fragments were sampled. When the sequenced genome is assembled, it is compared to the reference genome to identify SVs. The method enables discovery of novel sequence fragments (insertions). The approach is time-consuming, costly, and prone to assembly errors. In terms of computational efficiency and detection power, targeted SV assembly is more effective. They dissect a problem into a set of local assembly problems that can be more effectively solved.
4.5. Hybrid-approaches for SV calling
SV detection algorithms rely on different properties of the underlying data and vary in accuracy and sensitivity. One single method cannot detect complete range of SVs, each is limited to specific type of SVs. Combined approaches can overcome limitations of a single method . Two directions can be taken, combining strategies within one caller or combining SV callers . A class of SV detection methods bases on machine learning. Variations are identified by various methods and are filtered against empirically derived training set data.
4.6. Bioinformatics tools for structural variation calling
A number of algorithms have been proposed to address the issue of structural variants calling from NGS data, but the structural variation calling remains challenging. The complete range of SVs cannot be discovered using one single method. The process of SV calling usually utilizes multiple variant callers to overcome limitations of individual approaches. Knowing advantages and drawbacks of various tools (Table 1) is important to make proper decisions when designing NGS data analysis pipelines. Different callers yield lists of identified SVs with limited overlap. Pipelines SVMerge , HugeSeq , iSVP , and IntanSV that integrate different SV callers, such as BreakDancer, CNVnator, SVseq2, Pindel, and DELLY and merge their results were published.
|PEMer||Indels, inversions||paired-reads||2009 Feb|||
|VariationHunter||Transposon insertions||paired-reads||2010 Jun|||
|BreakDancer||Indels, inversions, and translocations||paired-reads||2009 Jul|||
|Pindel||Breakpoints of large deletions and medium-sized insertions||split-read||2009 Nov|||
|VariationHunter||Transposon insertions||paired-reads||2010 Jun|||
|Cortex||simple and complex SVs||de novo assembly||2011 Apr|||
|GASVPro||Indels, inversions, interchromosomal translocations||read-depth, paired-end||2012 Mar|||
|SVseq2||Indels with exact breakpoints||split-read, paired-end||2012 Apr|||
|Breakpointer||Indels, mobile insertions and non-homologous recombinations||read-depth, split-read,||2012 Apr|||
|DELLY||Copy-number variable deletions, tandem duplications, inversions, reciprocal translocations||split-read, paired-end||2012 Sep|||
|SVM2||Short insertions and deletions||paired-end, machine learning||2012 Oct|||
|PeSV-Fisher||Deletions, gains, intra- and interchromosomal translocations, and inversions||paired-reads, read-depth||2013 May|||
|LUMPY||Deletions, inversions, tandem duplications, and interchromosomal translocations||split-read, paired-end||2014 Jun|||
|Gustaf||Deletions, inversions, dispersed duplications and translocations of ≥30 bp||split-read||2014 Dec|||
|MetaSV||Indels, insertions, inversions, translocations, and CNVs||integration of SV callers (BreakSeq, Breakdancer, Pindel, CNVnator), local assembly||2015 Aug|||
|Manta||Medium-sized indels, large insertions||split-read, paired-end||2016 Apr|||
|SRBreak||CNV breakpoints||read-depth, split-read||2016 Sep|||
|Seeksv||Deletion, insertion, inversion and interchromosomal transfer||split-read, paired-end, read-depth fragments with two ends unmapped||2017 Jan|||
|SVachra||Large insertions-deletions, inversions, inter and intrachromosomal translocations||paired-end||2017 Oct|||
Using next-generation sequencing technologies, large volume of sequence data is produced with an unprecedented speed and constantly reducing cost. It allowed rapid progress in single-nucleotide polymorphisms detection. The awareness that structural variations represent a significant source of genotypic and phenotypic variation is permanently growing. However, the accurate detection of structural variants from NGS data is a daunting task. Relatively short reads, often repetitive character of SV, large amount of data, and large number of benign variants in complex genomes represent a major challenge for bioinformatics analysis of SVs. A number of algorithms have been proposed to address the issue of structural variants calling from NGS data. SV detection algorithms rely on different properties of the underlying data and vary in accuracy and sensitivity. SV detection process usually utilizes multiple variant callers. However, knowing advantages, drawbacks, and properties of different tools is inevitably required for proper decisions when designing NGS data analysis pipelines from publicly available tools. This chapter summarizes basic concepts of bioinformatics analysis of SV and introduces some rules for their assessment.
The authors acknowledge the financial support from the Slovenian Research Agency (research core funding P4-0220).
Conflict of interest
We have no conflict of interest to declare.