Aquatic models for human diseases
The most valuable application of next generation sequencing (NGS) technology is genome sequencing. Genomes of several aquatic models had been sequenced in the past few years due to their importance in genomics, development biology, toxicology, pathology, and cancer research. NGS technology is greatly advanced in sequencing length and accuracy, which facilitate the sequencing process, but sequence assembly, especially for the species with complicated genomes, is still the biggest challenge for bench-top scientists.
- aquatic models
Next generation sequencing (NGS) technology has been broadly used in biomedical research. The most valuable application of this technology is genome and transcriptome sequencing, which form a bridge to link fundamental discoveries in research using disease model systems to clinical application. Aquatic animal models are widely used in genomics, development biology, toxicology, pathology, and cancer research (for a recent review, see ). The genomes of several aquatic models had been sequenced using NGS technology over the past few years [2, 3]. NGS technology has been trending toward reduced cost with greater sequencing length and accuracy. While this has facilitated the sequencing process, sequence assembly remains a significant challenge for bench-top scientists, and especially for species with complicated genomes.
In this chapter, we will focus on the application of NGS in aquatic genome and transcriptome assemblies. Although our focus will be on the genome sequencing of aquatic models, the associated techniques, problems, concerns, and solutions can also be applied to genome sequencing of other model systems. Using
2. Aquatic animal models in biomedical research
In recent years, aquatic animal models have been widely used in human disease research. These model systems have demonstrated the usefulness for improving our understanding of disease pathology at the molecular and cellular biology levels and have facilitated the development of new diagnostic and therapeutic methods. A few examples of diseases modeled by aquatic models are summarized in Table 1.
An example of the use of an aquatic model for human disease research is the
Genomes of aquatic disease models serve as bridges to link phenotypic changes to genetic responses and allow physiological and pathophysiological discoveries from animal models to be applied to human disease research. The sequencing of model system genomes offers researchers great resources for biomedical research. Genome sequences allow researchers to (a) find sequence variation among genomes and transcriptomes between different species and populations; (b) compare genetic response between different phenotypes, development stages, disease conditions, drug treatment, etc.; and (c) discover gene/gene and gene/environment interactions and use these findings to direct medical applications.
3.1. Next generation sequencing
The NGS technique produces millions of short sequences (typical read length of 125 bp), which represent many unconnected small pieces of a genome or transcriptome, in each flow cell of the sequencing platform per sequence run. With these short sequences, one may
It is beyond the scope of this chapter to examine all of the current and upcoming sequencing technologies, and thus we focus on the most common NGS platform that is currently being employed to establish genomic and transcriptomic resources in aquatic models systems.
The Illumina genome analyzer platform is currently the most widely used NGS system accounting for over 70% of the NGS market . In Figure 1, we illustrate the basic steps of Illumina sequencing technology. The sequencing process starts with preparation of a library. The DNA (for genomic sequencing) or cDNA (for RNA sequencing) sample is sheared, usually by physical, enzymatic, or chemical method, into short fragments predetermined to be a specific size, and then sequencing adaptors are ligated to both ends of each short fragment by annealing. The fragments are then loaded onto a flow cell. The flow cell has oligonucleotides bound to the surface of the flow cell, and their sequences are complementary to the adaptors such that the free end of the fragment is attached to the flow cell via base pairing. A PCR step converts the initial fragment to its complementary sequence, and now both the forward strand and the reverse strand of fragments are bound to the surface of the flow cell (Figure 1). To amplify the signal, PCR is repeated for several rounds resulting in a cluster of copies around the initial copy of a fragment. Cyclic sequencing of these fragment clusters is very similar to Sanger sequencing and utilizes a sequence-by-synthesis process. One of two unique primers is attached to the free end of the bound fragments, and then nucleotides that each carries a different fluorescent reporter tag and a reversible terminator are flowed onto the flow cell. Since each nucleotide contains an elongation terminator, only a single nucleotide can be incorporated into newly synthesized sequences per sequencing cycle. After the nucleotide incorporation, laser sources excite the fluorescent reporter, and an optical sensor scans the entire flow cell to capture colors that represent newly added bases in every cluster. This optical information is converted to a base call for each growing sequence. At the end of each cycle, the terminator is removed and the next cycle continues until the desired sequence length is attained. In paired-end sequencing, after the forward strand sequence is attained, another sequence primer initiates the sequencing of the reverse strand of each fragment.
This massively parallel sequencing platform allows high throughput sequencing. Each flow cell contains 8 lanes with each lane producing 250 million reads (i.e., up to 500 GB/flow cell) with length of each sequence read ranging from 35 bp to 250 (Illumina HiSeq-2500) or 300 bp (Illumina MiSeq). Each sequencing adaptor has incorporated into it a unique barcode in the format of oligonucleotides. Thus, multiple samples from different sources can be pooled together in one lane, and this greatly facilitates the sequencing throughput.
Before subsequent sequence assembly or reference sequence alignment, a quality control step is usually necessary to attain sequences that best represent the biology being studied. A short sequencing result file contains two types of “contaminants” that can hinder the sequence assembly and result in misrepresentation of actual nucleotide sequence: adaptor sequence and low quality base calls. For paired-end sequencing, the length of DNA fragment between the two adaptor sequences is defined as “insertion size.” When the desired sequencing length is longer than insertion size, the short sequencing can contain adaptor sequence in it. This artificial sequence must be trimmed off, so as not to produce significant sequence error in sequence assemblies. Another contaminant, the low quality base call, has many sources, from equipment to sequencing glitches. The quality of a base call is defined as Phred quality score (
To retain the most usable as high-quality sequencing reads, the adaptor sequences are first clipped off, subsequently trim off low-quality base calls at the end of sequencing reads, and finally filter out sequence reads that contain a certain percentage of base calls that are below a defined
3.2. Sequence assembler algorithms
There are two major types of sequence assembly methods, Overlap-Layout-Consensus assembly and De Bruijn graph assembly. Current efficient and successful sequence assembly programs, including the ones employed for
De Bruijn graph-based assembler begins the assembling process by breaking the sequencing reads into
In Figure 2, 4 short DNA fragments that were attained from a randomly sheared 21 nt genome are sequenced. The
Taking ALLPATHS for instance, the memory use is estimated to be roughly 1.7 bytes per read base, which equals to a 102-GB RAM of a 60× coverage 1-GB genome. This level of RAM requirement can be fully fulfilled nowadays. Alternatively, this RAM requirement can be solved by sharing memory from different computer nodes, or by distributing the workload to different nodes within a computer cluster, which is normally accessible in most universities and research institutions. In addition, the development of cloud computing allows one to gain access to high-speed computer clusters in a pay-as-you-go manner, and there are several recently developed cloud-based sequence assemblers (summarized in Table 3).
Xiphophorusgenome sequencing and assembly
3.3.1. Biological sample
3.3.2. Genome sequencing and assembly
The Illumina HiSeq-2000 platform was chosen for
The first stage contig assembly was carried out by ALLPATHS using only the Illumina sequencing reads. This step generated contig-level assembly with N50 of 60 kb and 30 kb for
These contigs were further grouped into scaffolds using the
The construction of chromosomal level genome was accomplished by aligning
3.3.3. Genome annotation
To annotate the newly assembled
Transcript sequences and associated functional annotations can be transferred between closely related species. A modified gene annotation method, RATT, was applied using the
To compare to this RATT annotation transfer method,
Comparing these two methods of annotation to each other in perspective of transcriptome quality,
In conclusion, the
3.3.4. Transposable elements analysis
As found previously,
To annotate the TEs in
4. Problems and potential resolutions in genome assembly
4.1. Repetitive sequences in genome result in gaps of assembly
Several aquatic model genomes have been sequenced, assembled, and annotated for public use due to the activities of the aquatic model community. During the genome sequencing and assembling process for many of these model systems, several problems have been encountered. Specific sequence architecture (e.g., repetitive sequences) may confuse assembly algorithms and results in gaps in sequence contiguity that ultimately lead to a poorly-assembled genome or no assembly at all. For example,
Although the length of sequencing reads continues to expand, repetitive sequences are still the main barrier encountered, toward a goal of uninterrupted consensus base counts. It is well known no graphical-based assembly method completely resolve repeat structure. Both graphical approaches, De Bruijn and Overlap-Layout-Consensus, will exclude repetitive sequence by truncating the assembly when certain repeat types are encountered or alternatively collapse unique repeats into a single representation (Figure 2). This leaves gaps in sequence assembly and collapses long repeat sequences. Some of the gaps can be closed by using proper oriented paired-end reads with long insertion sizes, such as bacteria artificial chromosome or P1-derived artificial chromosome clones. However, in most cases, such long insert resources are not available. During scaffold assembly of
4.2. Long sequencing reads are possible solution to assembly issues
Since repetitive sequences are the major causes of gaps in sequence assemblies, one way to maximize assembly contiguity is to employ long reads that are capable of covering the entire repetitive regions. The Pacific Bioscience (PacBio, www.pacificbiosciences.com) P6-C4 sequencing platform now offers the longest sequencing reads in the field, with longest sequence read length of 40 kbp and an average length of ~10 kbp (Figure 3).
Since PacBio long sequencing reads are capable of traveling through the repeat regions, therefore gaps are less likely to be present when assembling the genome. In several recent aquatic genome-sequencing projects, the incorporation of PacBio sequencing technology in concert with very deep Illumina 100 bp paired-end reads (60× coverage) significantly improved the quality of genome assembly. For example, using 8×–30× PacBio sequence coverage, 62% of gaps could be closed with a 2-fold increase in N50 contig length for the blind cavefish genome build (unpublished data). Similarly, gap filling using long sequencing reads almost tripled the N50 contig length (from 5 kb to 14 kb) for the ice fish genome, but this genome assembly remains plagued with difficult regions that have yet to be resolved (unpublished data).
The usage of long sequencing reads to improve the current genome builds is not limited to aquatic genome research as this application has also been utilized in the improvement of genome quality of other model organisms as well (e.g., avian models ). For example, the current chicken reference genome has 8106 gaps within scaffolds. After PacBio’s long sequence reads (10× coverage) were incorporated, 6888 of these gaps were closed, along with 6.3 Mb of new sequence added (unpublished data).
For small genomes (<200 Mb haploid size), long sequencing read technology has advanced to a stage where near complete genomes can be represented. For example, the
In addition to improving current genome assembly quality, long sequencing reads are capable of sequencing full-length transcripts, thus facilitating gene expression analyses and transcriptome assembly. Current RNA-Seq tasks apply short reads (50 bp single-end to 125 bp paired-end depends on experiment design) to fragmented cDNA libraries. These short reads are then aligned to either reference genome or an array of reference transcripts for statistical analysis of gene expression. Uniquely aligned short reads provide solid evidence of the expression levels of the aligned genes. However, inappropriate treatment of ambiguously aligned reads can lead to biased or even mistaken expression profiles in complicated vertebrate genomes (e.g., zebrafish genome and human genome). This problem severely affects transcript variance discovery such as alternative splicing and relative expression of alternative splicing isoforms, which play significant roles in pathological processes (e.g., Bcl11b1). Alternative splicing isoform expression quantification heavily relies on distribution of short reads on each exon; thus, low-coverage splicing isoforms cannot be distinguished . The utilization of PacBio long-read sequencing platform can eliminate this problem by providing long reads that are capable of covering all connected exons in one single read, thus avoiding mistakes in assigning reads to a certain exons .
5. Perspectives in aquatic genome research
The availability of aquatic genome models in the past few years significantly expends the resources for biological and biomedical discovery. However, as detailed, problems persist in the current aquatic model draft assemblies (i.e., gaps in and between scaffold and repetitive sequence). Over the next few years, there should be a concerted effort to (a)
De novogenome assembly using long sequencing reads
In Table 6, we show estimated sequence gaps missing from within scaffolds. It is estimated that 2–5% of each genome is not sequenced or assembled outside of scaffold gaps (unpublished result). Previous tasks to close gaps in the assemblies of other species genomes have shown that structurally variant alleles, simple tandem repeats, and high GC content regions account for the majority of these gaps. The new PacBio sequencing technology, if used to produce high coverage (at least 60×) fragments, may be expected to overcome many of these assembly problems and should result in better-represented genome models. Assembling genomes using PacBio sequencing reads requires special treatment to the raw reads, as well as the sequence assembling processes. For example, the multiple-pass raw reads from circular sequencing template need to be clipped into subreads that represent the DNA fragment. The PacBio sequencing reads also need to be error-corrected using Quiver. The sequence assembling process with these very long reads requires different tools than what were discussed above. MinHash Alignment Process (MHAP) that is included in Celera Assembler PBcR pipeline is a reference implementation of a probabilistic sequence overlapping algorithm that is designed for detecting overlaps between long-read sequence data . It is therefore a proper tool for sequence assembly that employs long sequencing reads.
During the process of
5.2. Chromosome level aquatic genome assembly
Accurate chromosome assemblies require correctly ordered contigs in scaffolds for gene functional interpretation. During chromosome construction, the placement and order of scaffolds on chromosomes relies on a genetic map, which is based on meiotic recombination. Among the aquatic genome models created in the past few years, the
Recently, new optical mapping technology has been provided by BioNano (http://www.bionanogenomics.com). The optical mapping improves the process of constructing whole genome physical map. In this process, high molecular weight genomic DNA is immobilized onto the positively charged glass surface of a chip-like device having engraved nano-channels that are only wide enough to stretch a single DNA molecule. Buffer fluid that flows though the channel stretches a single DNA molecule to maintain its orientation and integrity. The DNA molecules are subsequently sheared by a restriction enzyme into fragments that are stained with florescent dye. An imaging system then measures the florescent light intensity that represents the length of each DNA fragment. Accompanied with the restriction enzyme site sequence, the length of each fragment is linked to form a single-molecule optical restriction map.
During chromosome assembly, the scaffold sequences can be converted to
Aquatic models are proven to be as important and useful as other animal models to study the etiology and progression of human disease. Aquatic models have gained the attention of funding agencies, and the overall research community using aquatic models has grown rapidly. This growth has resulted in the availability of genome and reference transcriptome resources. The aquatic genome models that were constructed in the past few years are available through NCBI or Ensembl with new updates constantly being made. Although problems persist in genome assembly of complicated structures, newer sequencing platforms, mapping technologies, and sequence assembly algorithms are expected to rapidly address these problems and soon offer the community much improved resources.