Open access peer-reviewed chapter

Mining for Structural Variations in Next-Generation Sequencing Data

By Minja Zorc, Jernej Ogorevc and Peter Dovč

Submitted: January 20th 2018Reviewed: March 17th 2018Published: June 20th 2018

DOI: 10.5772/intechopen.76568

Downloaded: 319

Abstract

Genomic structural variations (SVs) are genetic alterations that result in duplications, insertions, deletions, inversions, and translocations of segments of DNA covering 50 or more base pairs. By changing the organization of DNA, SVs can contribute to phenotypic variation or cause pathological consequences as neurobehavioral disorders, autoimmune diseases, obesity, and cancers. SVs were first examined using classic cytogenetic methods, revealing changes down to 3 Mb. Later techniques for SV detection were based on array comparative genome hybridization (aCGH) and single-nucleotide polymorphism (SNP) arrays. Next-generation sequencing (NGS) approaches enabled precise characterization of breakpoints of SVs of various types and sizes at a genome-wide scale. Dissecting SVs from NGS presents substantial challenge due to the relatively short sequence reads and the large volume of the data. Benign variants and reference errors in the genome present another dimension of problem complexity. Even though a wide range of tools is available, the usage of SV callers in routine molecular diagnostic is still limited. SV detection algorithms relay on different properties of the underlying data and vary in accuracy and sensitivity; therefore, SV detection process usually utilizes multiple variant callers. This chapter summarizes strengths and limitations of different tools in effective NGS SV calling.

Keywords

  • bioinformatics
  • genome organization
  • next-generation sequencing
  • structural variation
  • variant calling

1. Introduction

First, efforts in exploring genetic variations were focused on single-nucleotide polymorphisms (SNPs) which were initially considered the main source of genetic and phenotypic human variation [1], while larger variations were thought to be rare events. However, in 2004 two studies [2, 3] revealed an unexpectedly large amount of large-scale variations (several kb to hundreds of kb) in the human genome. The evidence for the prevalence of structural variants (SVs), such as deletions, duplications, and inversions, began to accumulate. By changing the organization of the DNA, SVs can contribute to the phenotypic differences among healthy individuals or cause severe phenotypic consequences. SVs are involved in a wide range of diseases and conditions, such as autism spectrum disorders [4, 5, 6], schizophrenia [7], Crohn’s disease [8], rheumatoid arthritis [9], lupus erythematosus [10], psoriasis [11], obesity [12], and cancers [13, 14]. Among the different classes of genetic variations, SVs have remained the most challenging to detect and characterize. SVs were examined since the identification of chromosomal abnormalities using classic cytogenetic methods, revealing changes down to 3 Mb. Later techniques for SVs detection are based on array comparative genome hybridization (aCGH) and single-nucleotide polymorphism arrays. Next-generation sequencing (NGS) has enabled methods for precise definition of breakpoints of SVs of different sizes and types. Characterization of SVs from high-throughput sequencing data presents complex task due to the volume of the data and short sequence reads.

2. Structural variations

Genomic structural variations (SVs) are genetic alterations that result in amplifications, losses, inversions, and translocations of segments of DNA greater than 50 bp. SVs are a normal part of genomic variation but can also cause disorders. Standard detection methods include chromosome banding, fluorescent in situ hybridization (FISH), and array comparative genome hybridization (aCGH) that is very useful to detect copy number variations (CNVs) but cannot detect copy-neutral SVs (inversions, balanced translocations) [15]. Recent methods include employment of NGS to identify SVs, which are not detectable by cytogenetic methods.

Chromosomal rearrangements can occur on a single chromosome (interchromosomal SVs) or can involve exchange of genomic DNA between chromosomes (intrachromosomal SVs). Intrachromosomal SVs are a product of one or more double-strand breaks, which may result in deletions, inversions, and duplications. Deletions and duplications are copy number variations and are easily detected by employing NGS data (read coverage method), whereas inversions are copy number-neutral. Intrachromosomal translocation is the exchange of genetic material between two non-homologous chromosomes. In a reciprocal translocation, two broken-off pieces of two non-homologous chromosomes are exchanged, usually producing two balanced derivative chromosomes. Unless breakpoints disrupt important developmental genes, balanced translocations do not affect phenotype [15]. However, during gamete formation such chromosomes may segregate in unbalanced manner or unbalanced translocations may occur de novo and lead to monosomy and trisomy of different chromosome segments [16], which account for approximately 1% of developmental delay and intellectual disability cases in human [17, 18, 19]. Robertsonian translocations are a type of SVs resulting from chromosome end breaks near centromeric regions of two acrocentric chromosomes and their reciprocal exchange, which results in one large metacentric chromosome and one very small chromosome that is usually lost without phenotype effect. In case three or more chromosomal breakpoints are involved, we speak of complex chromosome rearrangements, which may result in balanced or unbalanced state [20].

3. Next-generation sequencing

The first commercially available next-generation sequencing platform was released in 2005 [21]. The technology has been continuously upgraded and has fundamentally changed the field of genetics studies. Next-generation sequencing (NGS), also known as high-throughput sequencing, parallelizes the sequencing process and produces millions of short reads (50–400 bp each) in a single experimental run. It has contributed to rapid progress in single-nucleotide polymorphisms detection. Due to the nature of the NGS short-read sequences, the category of longer variants remained poorly characterized. Variants in range 10–100 kb are small for detection by cytogenetic methods [22] but too large for reliable detection with short-read sequencing. SVs affect more bases than single-nucleotide polymorphisms [23] and present an important class of genetic variation. Moreover, many SVs have been shown to play relevant roles in phenotypic variability and disease [24].

3.1. NGS data analysis pipeline

Once the samples are sequenced, the NGS data analysis becomes the task in bioinformatics field. The computational analysis and interpretation of the data generated remains one of the major bottlenecks in NGS projects. The basic steps for analyzing NGS data are quality assessment, reads alignment (mapping) to a reference sequence, and variant identification. The second stage of analysis comprises variant analysis, visualization, and interpretation of the variants in relation to phenotypes. Commercial packages such as CLCBio Genomic Workbench, CASAVA, and SeqNext often provide all-in-one solutions, while academic pipelines typically consist of sequential tools for specific steps in the analysis.

The output from the sequencing machines are reads, which are usually stored in text-based FASTQ files. The data obtained from NGS are compromised by sequence artifacts, including read errors, poor-quality reads, and primer contamination [25]. To avoid erroneous conclusions, the artifacts should be removed. A number of bioinformatics tools for sequencing quality assessment, such as FastQC, FASTX-Toolkit, PRINSEQ [26], TagDust [27], and NGS QC Toolkit [28] are designed. Next step in NGS data analysis is alignment of short reads to corresponding positions on a reference sequence. A variety of algorithms have been developed for this task. Representative read mappers are Bowtie2 [29], BWA [30], and Novoalign. The typical output from the read mapper is BAM file which contains information about qualities and positions of aligned sequences. Variant analysis consists of genotyping, variant calling, annotation, and prioritization. Genomic variants, such as SNPs and short-scale insertions and deletions are identified by variant callers. Widely used tools for variant calling are Genome Analysis Tool Kit HaplotypeCaller (GATK-HC) [14], Samtools mpileup [31], Freebayes and Torrent Variant Caller (TVC). Variant callers take in a BAM file and return a list of variants. To annotate variants, SnpEff [32], VariantAnnotator from the GATK [33], and ANOVAR [34] tools are used. To systematically filter, evaluate, and prioritize thousands of variants VAAST 2.0 [35], VarSifer [36], KGGseq [37], and commercial software Ingenuity Variant Analysis are available.

3.2. Single-read and paired-end sequencing

Initially, NGS technologies produced extremely short reads (25–36 bp), sequenced from only one end of the DNA (single-read sequencing) [38]. As technology developed, read lengths consistently increased and sequencers have been improved to sequence both ends of a fragment with or without a non-sequenced stretch in between (paired-end sequencing). This not only has the benefit from doubling the number of reads but also improves accuracy and offers additional information for structural variants detection.

The reads obtained from paired-end sequencing (R1 and R2) come from the same fragment of DNA. The length of the fragment is usually longer than the length of reads (R1 + R2), so there is a gap between them (Figure 1). Although the sequence of the fragment between reads is not known, the knowledge that R1 and R2 are next to each other on the known distance and have opposite orientation is useful.

Figure 1.

Paired-end sequencing; the inner distance between paired reads (R1 and R2) is known.

4. Overview of the structural variation detection algorithms

Using NGS technologies, large volume of sequence data at an unprecedented speed and constantly reducing cost is produced. Consequently, the computational tools for analysis of massive amounts of genomic data are in demand. There is a growing awareness that structural variations represent a significant contribution to genotypic and phenotypic diversity [39]. However, the accurate detection of structural variants from NGS is a daunting task [40]. A number of algorithms have been proposed to address the issue of structural variants calling from NGS data [41]. SV detection algorithms rely on different properties of the underlying data and vary in accuracy and sensitivity. The algorithms follow one or a combination of strategies, which could be classified into categories: (1) read depth (RD), (2) paired-end (PE), (3) split reads (SR), and (4) de novo assembly (AS). The most suitable method for SV detection depends on the size and variant type as well as characteristics of the sequencing data [42]. SV detection process usually utilizes multiple variant callers.

4.1. Algorithms based on read depth

Read depth (RD) algorithms are able to identify CNVs. RD-based algorithms can accurately predict absolute copy-numbers [43] but are unable to detect copy-number neutral variants such as inversions and balanced translocations. The breakpoint identification resolution is low and depends on the sequence coverage.

RD algorithms divide the reference sequence in intervals and calculate the number of reads aligned within them. The read depth per interval should follow a normal distribution centered at the average read depth for the entire reference sequence. When the read depth of contiguous intervals significantly differs from the average observed, the CNV is detected (Figure 2). Deleted regions show reduced read depth when compared to entire reference sequence (Figure 3).

Figure 2.

An example of CNV including gene KIT with flanking regions in four pig genomes. The read coverage is higher in the region of the CNV. The figure was made using Golden Helix GenomeBrowse.

Figure 3.

An example of deletion within upstream and downstream regions of LEPR locus in five pig genomes. The read coverage is low in the region of deletion. The figure was made using Golden Helix GenomeBrowse.

4.2. Paired-end approaches

Paired-end sequencing data allow detection of many types of SVs. Paired-end (PE) SV calling approaches detect deviations from expected library insert size (donor reads map at inconsistent distances). When a pair of reads does not overlap with any SV, the distance between them is the same as the size of the library insert and reads have correct orientation (concordant pairs). When the read pair overlaps a SV, the mapping distance of paired reads differs from the library insert size and their orientation may be inverted. Discordantly mapped paired-reads can be (1) further apart than expected, (2) closer together than expected, (3) in inverted orientation, (4) in incorrect order, (5) on different chromosomes. Clusters of read pairs aligned to the same genomic regions with the distance shorter than expected can be explained by insertion in the sequenced samples (donor). Larger distances between reads than expected can be explained by deletion in the sample (donor) (Figure 4). The resolution of the breakpoints detected by this approach depends on the library’s insert size and on the read coverage. Insertions larger than the library insert size cannot be detected.

Figure 4.

Examples of identification of deletion, insertion, and inversion using paired-end approach: (A) paired-reads are closer together than expected (deletion), (B) paired-reads are further apart than expected (insertion), (C) paired-reads are in inverted orientation (inversion).

4.3. Algorithms based on split-reads

Split-read (SR) algorithm can detect SVs with a single base-pair resolution. Split reads contain the breakpoint of the structural variant. Their alignments to the reference genome are split into two parts (Figure 5). Parts of a read are independently aligned to the reference genome, so the reads should be long enough to be aligned uniquely. Therefore, algorithms based on split-reads are feasible only when the sequencing reads are sufficiently long.

Figure 5.

An example of deletion in an individual genome detected by split-read method.

4.4. Algorithms based on de novo assembly

Algorithms based on de novo assembly (AS) are able to detect all forms of structural variation. De novo assembly refers to reassembling the original sequence from which the fragments were sampled. When the sequenced genome is assembled, it is compared to the reference genome to identify SVs. The method enables discovery of novel sequence fragments (insertions). The approach is time-consuming, costly, and prone to assembly errors. In terms of computational efficiency and detection power, targeted SV assembly is more effective. They dissect a problem into a set of local assembly problems that can be more effectively solved.

4.5. Hybrid-approaches for SV calling

SV detection algorithms rely on different properties of the underlying data and vary in accuracy and sensitivity. One single method cannot detect complete range of SVs, each is limited to specific type of SVs. Combined approaches can overcome limitations of a single method [44]. Two directions can be taken, combining strategies within one caller or combining SV callers [45]. A class of SV detection methods bases on machine learning. Variations are identified by various methods and are filtered against empirically derived training set data.

4.6. Bioinformatics tools for structural variation calling

A number of algorithms have been proposed to address the issue of structural variants calling from NGS data, but the structural variation calling remains challenging. The complete range of SVs cannot be discovered using one single method. The process of SV calling usually utilizes multiple variant callers to overcome limitations of individual approaches. Knowing advantages and drawbacks of various tools (Table 1) is important to make proper decisions when designing NGS data analysis pipelines. Different callers yield lists of identified SVs with limited overlap. Pipelines SVMerge [46], HugeSeq [47], iSVP [48], and IntanSV that integrate different SV callers, such as BreakDancer, CNVnator, SVseq2, Pindel, and DELLY and merge their results were published.

ToolSV typeStrategyReleasedReference
PEMerIndels, inversionspaired-reads2009 Feb[49]
VariationHunterTransposon insertionspaired-reads2010 Jun[50]
SegSeqCNVsread-depth2009 Jan[51]
BreakDancerIndels, inversions, and translocationspaired-reads2009 Jul[52]
PindelBreakpoints of large deletions and medium-sized insertionssplit-read2009 Nov[53]
VariationHunterTransposon insertionspaired-reads2010 Jun[50]
Cortexsimple and complex SVsde novo assembly2011 Apr[54]
CNVnatorCNVsread-depth2011 Jun[55]
GASVProIndels, inversions, interchromosomal translocationsread-depth, paired-end2012 Mar[56]
SVseq2Indels with exact breakpointssplit-read, paired-end2012 Apr[57]
BreakpointerIndels, mobile insertions and non-homologous recombinationsread-depth, split-read,2012 Apr[58]
DELLYCopy-number variable deletions, tandem duplications, inversions, reciprocal translocationssplit-read, paired-end2012 Sep[59]
SVM2Short insertions and deletionspaired-end, machine learning2012 Oct[60]
PeSV-FisherDeletions, gains, intra- and interchromosomal translocations, and inversionspaired-reads, read-depth2013 May[61]
LUMPYDeletions, inversions, tandem duplications, and interchromosomal translocationssplit-read, paired-end2014 Jun[62]
GustafDeletions, inversions, dispersed duplications and translocations of ≥30 bpsplit-read2014 Dec[63]
MetaSVIndels, insertions, inversions, translocations, and CNVsintegration of SV callers (BreakSeq, Breakdancer, Pindel, CNVnator), local assembly2015 Aug[64]
MantaMedium-sized indels, large insertionssplit-read, paired-end2016 Apr[65]
SRBreakCNV breakpointsread-depth, split-read2016 Sep[66]
SeeksvDeletion, insertion, inversion and interchromosomal transfersplit-read, paired-end, read-depth fragments with two ends unmapped2017 Jan[67]
SVachraLarge insertions-deletions, inversions, inter and intrachromosomal translocationspaired-end2017 Oct[68]

Table 1.

The list of tools for different types of SV calling.

5. Conclusions

Using next-generation sequencing technologies, large volume of sequence data is produced with an unprecedented speed and constantly reducing cost. It allowed rapid progress in single-nucleotide polymorphisms detection. The awareness that structural variations represent a significant source of genotypic and phenotypic variation is permanently growing. However, the accurate detection of structural variants from NGS data is a daunting task. Relatively short reads, often repetitive character of SV, large amount of data, and large number of benign variants in complex genomes represent a major challenge for bioinformatics analysis of SVs. A number of algorithms have been proposed to address the issue of structural variants calling from NGS data. SV detection algorithms rely on different properties of the underlying data and vary in accuracy and sensitivity. SV detection process usually utilizes multiple variant callers. However, knowing advantages, drawbacks, and properties of different tools is inevitably required for proper decisions when designing NGS data analysis pipelines from publicly available tools. This chapter summarizes basic concepts of bioinformatics analysis of SV and introduces some rules for their assessment.

Acknowledgments

The authors acknowledge the financial support from the Slovenian Research Agency (research core funding P4-0220).

Conflict of interest

We have no conflict of interest to declare.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Minja Zorc, Jernej Ogorevc and Peter Dovč (June 20th 2018). Mining for Structural Variations in Next-Generation Sequencing Data, Bioinformatics in the Era of Post Genomics and Big Data, Ibrokhim Y. Abdurakhmonov, IntechOpen, DOI: 10.5772/intechopen.76568. Available from:

chapter statistics

319total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Graphical Representation of Biological Sequences

By Satoshi Mizuta

Related Book

First chapter

Virtual Plant Breeding

By Sven B. Andersen

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More about us