Open access peer-reviewed chapter

The Study of Hepatitis B Virus Using Bioinformatics

By Trevor Graham Bell and Anna Kramvis

Submitted: October 29th 2015Reviewed: March 14th 2016Published: July 27th 2016

DOI: 10.5772/63076

Downloaded: 2158


Hepatitis refers to the inflammation of the liver. A major cause of hepatitis is the hepatotropic virus, hepatitis B virus (HBV). Annually, more than 786,000 people die as a result of the clinical manifestations of HBV infection, which include cirrhosis and hepatocellular carcinoma. Sequence heterogeneity is a feature of HBV, because the viral-encoded polymerase lacks proof-reading ability. HBV has been classified into nine genotypes, A to I, with a putative 10th genotype, “J,” isolated from a single individual. Comparative analysis of HBV strains from various geographic regions of the world and from different eras can shed light on the origin, evolution, transmission and response to anti-HBV preventative, and treatment measures. Bioinformatics tools and databases have been used to better understand HBV mutations and how they develop, especially in response to antiviral therapy and vaccination. Despite its small genome size of ~3.2 kb, HBV presents several bioinformatic challenges, which include the circular genome, the overlapping open reading frames, and the different genome lengths of the genotypes. Thus, bioinformatics tools and databases have been developed to facilitate the study of HBV.


  • alignments
  • computation
  • databases
  • genotypes
  • phylogenetics

1. Introduction

Primarily, bioinformatics is the use of computational science to study biological and clinical data using statistics, mathematics, and information theory. This field is developing and evolving; thus, the definition cannot be precise. Moreover, the field is broad, ranging from the study of DNA and proteins, to structural biology, drug design and comparative genomics, transcriptomics, proteomics, and metagenomics. The optimization of computational technology is paramount in order to handle, store, manage, and analyze the large volumes of data generated in the last decade. The data include molecular sequencing data of host and pathogen genomes and their associations to demographic and clinical records, laboratory test results, as well as information on treatment. Moreover, bioinformatics can aid in the investigation of virus–host genome and environmental interactions and in the identification of both host and viral biomarkers. This analysis can lead to a better understanding of clinical manifestation of disease and effective design of preventative and treatment measures [1].

In the first section, we describe the unique genomics and molecular biology of hepatitis B virus (HBV). Using illustrative examples, we showed how bioinformatics analyses can facilitate the understanding of the origin, evolution, transmission, and response to antiviral agents of HBV. Next, we described the bioinformatics challenges posed by HBV and present the public databases and tools currently available for the study of HBV.


2. Hepatitis B virus

2.1. Hepatitis

Hepatitis refers to the inflammation of the liver. A major cause of hepatitis is the hepatotropic virus, HBV. HBV infection is a public health problem of worldwide importance. Globally, 2 billion people have been exposed to this virus at some stage of their lives, and 240 million are chronic carriers of the virus [2].

This infection can lead to a spectrum of clinical consequences. In the majority of cases, the infection is subclinical and transient, whereas in 25% of cases, it can cause self-limited acute hepatitis and in 1% of these progress to acute liver failure. The virus can persist in 90% of neonates and 5–10% of adults, leading to chronic infection that can progress to either chronic hepatitis or an asymptomatic carrier state. Both of these states can ultimately develop liver cancer or hepatocellular carcinoma (HCC), with or without the intermediate cirrhotic stage. Annually, more than 786,000 people die as a result of these clinical manifestations of HBV infection [3].

2.2. Prevalence

The prevalence of HBV in a community can be estimated by the proportion of the population, who are hepatitis B surface antigen (HBsAg)-positive carriers. HBV prevalence varies widely in the world [3]. The prevalence is low (<1%) in northern Europe, Australia, New Zealand, Canada, and the United States of America. Northern Asia, the Indian subcontinent, parts of Africa, Eastern and south-eastern Europe, and parts of Latin America are areas of intermediate prevalence (1–5%). The high prevalence areas (5–20%) include East and Southeast Asia, the Pacific Islands, and sub-Saharan Africa.

2.3. Classification and structure

HBV, the prototype member of the family Hepadnaviridae, belongs to the genus Orthohepadnavirus. With a diameter of 42 nm and a DNA genome of ~3.2 kilobases (kb), it is the smallest DNA virus infecting man. The genome is circular and partially double stranded. One DNA strand is complete, except for a small nick (the minus strand), and the other is short and incomplete (the plus strand). The minus strand contains four overlapping open reading frames (ORFs; Figure 1) [4] that represent: (1) the preS/Sgene that codes for the envelope proteins, large, middle, and small HBsAgs; (2) the Pgene for DNA polymerase/reverse transcriptase (POL); (3) the Xgene for the X protein, a key regulator during the natural infection process, which has transcriptional trans-activation activity and is required to initiate and maintain HBV replication [5]; and (4) the precore/coregene that codes for the HBcAg or core protein that forms the capsid and for an additional protein known as HBeAg, which is not incorporated into the virus itself but is expressed on the liver cells and secreted into the serum. Figure 2 illustrates the structure of the hepatitis B virion.

Figure 1.

The genome of hepatitis B virus (HBV). The partially double-stranded DNA (dsDNA) with the complete minus (−) strand and the incomplete (+) strand. The four open reading frames (ORFs) are shown:precore/core (preC/C)that encodes the e antigen (HBeAg) and core protein (HBcAg);Pfor polymerase (reverse transcriptase),PreS1/PreS2/Sfor surface proteins [three forms of HBsAg, small (S), middle (M), and large (L)] andXfor a transcriptional trans-activator protein.

Figure 2.

Schematic representation of hepatitis B virus (HBV), showing the structure of the virion, composed of a partially double-stranded DNA genome, enclosed by a capsid, comprised of HBcAg and surrounded by a lipid envelope containing large (L)-HBsAg, middle (M)-HBsAg, and small (S)-HBsAg. The virus also expresses two non-particulate proteins X protein and HBeAg.

2.4. Regulatory elements of HBV

Every single nucleotide of the HBV genome is necessary for the translation of a protein and may also be part of one of the regulatory elements of HBV, which overlap with protein expressing regions. The regulatory elements include the S1 and S2 promoters, which overlap both the preS region and polymerase ORFs; the preC/pregenomic promoter, which includes the basic core promoter (BCP) and overlaps the X and preC ORF; and the X promoter. There are two enhancers (enhancer I and enhancer II) as well as cis-acting negative regulatory elements (URR: upper regulatory region, CURS: core upstream regulatory sequence, NRE: negative regulatory element). These regulatory elements control transcription (reviewed in [6, 7]).

2.5. Replication of HBV

HBV and other members of the family Hepadnaviridaehave an unusual replication cycle. These DNA viruses replicate by reverse transcription of a RNA intermediate known as the pregenomic RNA (pgRNA) [8]. Entry into the cell is via the sodium taurocholate cotransporting polypeptide (NTCP), a multiple transmembrane transporter predominantly expressed in the liver [9]. After entry, the virion is uncoated and the core particle is actively transported to the nucleus [10], where the partially double strand relaxed circular DNA molecule is released. The single-stranded gap is closed by the viral polymerase to yield a covalently closed circular molecule of DNA (cccDNA) [11], which is the template for transcription by the host RNA polymerase II [12]. The mRNAs are transported into the cytoplasm where they are translated into the seven viral proteins. In addition to being translated into the polymerase and the core protein, the pgRNA is packaged into immature core particles by the process known as encapsidation. In order to be encapsidated, the 5′ end of the pgRNA has to be folded into a particular secondary structure known as the encapsidation signal (ε) [13].

The encapsidation signal (ε) is a bipartite stem-loop structure, consisting of an upper and lower stem, the bulge, and an apical loop. Besides encapsidation, ε has a number of other functions (reviewed in [13]) and references therein. It acts in template restriction so that not any piece of RNA is encapsidated, and it also plays a role in the activation of the viral polymerase, so that there is no indiscriminate reverse transcription. It is also involved in the initiation of reverse transcription. The polymerase or reverse transcriptase acts as a primer of RNA-directed DNA synthesis by the binding of the polymerase to the bulge of ε. The first three nucleotides of the negative stand of DNA are synthesized at the bulge and are transferred to an acceptor site on the 3’ end of the pgRNA, where DNA synthesis proceeds toward the 5′ end of the pgRNA [14], giving rise to the immature virion. The virus matures by acquiring its glycoprotein envelope, containing HBsAg, in the endoplasmic reticulum and is exported by vesicular transport from the cell [15].

2.6. Genotypes and subgenotypes of HBV

Sequence heterogeneity is a feature of HBV, because the viral-encoded polymerase lacks proof-reading ability as mentioned above [16]. Using phylogenetic analysis of the complete genome of HBV and an intergroup divergence of greater than 7.5%, HBV has been classified into nine genotypes, A to I [17, 18, 19], with a putative 10th genotype, “J,” isolated from a single individual [20]. With between ~4 and ~8% intergroup nucleotide difference across the complete genome and good bootstrap support, genotypes A–D, F, H, and I are classified further into at least 35 subgenotypes [21]. The genotypes differ in genome length, the size of ORFs and the proteins translated [17], as well as the development of various mutations [22]. Generally, the genotypes, and in some cases the subgenotypes, have a distinct geographic distribution (Table 1).

GenotypeLength Differentiating
SubgenotypesGeographic distributionSerological subtypeTransmission route
A32216-nucleotide insert at
carboxyl end of core gene
A1Africa#adw2/ayw2Horizontal: parenteral or sexual
A2Europe/North Americaadw2
subgenotype A3(A3,A4,A5)§
Africa, Haitiayw1
subgenotype B3(B3,B5,B7–B9,B6(China)§
Quasi-subgenotype C2(C2,C14, undefined sequences)§Japan/China/Koreaadr
C3New Caledonia/PolynesiaadrPerinatal
D318233-nucleotide deletion at the amino terminus of the preS1 regionD1Middle East, Central Asiaayw2
D4Australian aborigines, Micronesians,
Papua New Guineans, Arctic Denes
ayw2Horizontal: parenteral with intravenous drug use being a risk factor
E32123-nucleotide deletion at the amino terminus of the preS1 regionWestern/Central Africaayw4Horizontal
F3215F1Argentina/Costa Rica/El Salvador, Alaskaadw4

Table 1.

Comparison of the virological and clinical characteristics of the genotypes and subgenotypes of HBV.

Summarizes data compiled from Kramvis [21] and references cited therein.

§Earlier subgenotype designation.

*Rare serological subtype for that genotype.

#And in regions outside Africa where there was historical forced migration as a result of the slave trade [23].

¥Vietnamese residing in Canada [24].

2.7. Genotyping and subgenotyping methods

HBV genotypes, and in some cases subgenotypes and various mutations, can influence the clinical course of disease [22] as well as response to antiviral therapy [25] and can be used to show transmission [26] and to trace human migrations [23]. Thus, HBV genotyping is becoming increasingly relevant in the clinical setting and may contribute to future personalized treatment [27] and may be important in epidemiological and transmission studies. Bioinformatics has played a major role in the development of various tools that can be used for identifying genotypes/subgenotypes and detecting various mutations. Therefore, a number of methods have been developed [28, 29].

Although analysis of the HBV Sgene sequence is sufficient to classify HBV into genotypes [30], the complete genome sequence provides additional information with respect to phylogenetic relatedness [31, 32], including the identification of recombinants. Furthermore, even though complete genome analysis is the gold standard for genotyping, it does not allow for rapid and direct analysis on a large scale basis [17] and requires expertise and thus capacity development in computer processing coupled with phylogenetic analyses. In order to expedite and facilitate genotyping, a number of methods have been developed [17, 28, 29]. Each one has its advantages and disadvantages [17, 28, 29], which should be taken into account, when selecting the genotyping method appropriate for a particular study or application.

2.8. Phylogenetic analyses of HBV

Although, as already mentioned, the error-prone polymerase of HBV leads to sequence heterogeneity [16], the degree, at which this can occur, is constrained by the partially overlapping ORFs and the presence of secondary RNA structures, such as ε, coded by non-overlapping regions [33, 34]. The HBV genome has been estimated to evolve with an error rate of ~10−3–10−6 nucleotide substitutions/site/year [3541], although this rate is not constant within the different regions of the HBV genome [41]. The progress of computers and information technology has played an important role in the development of phylogenetic analysis as a powerful tool in the analysis of the molecular evolution of viruses.

As exemplified in the next sections, comparative analysis of HBV strains from various geographic regions of the world and from different eras can shed light on the origin, evolution, transmission, and response to anti-HBV preventative and treatment measures.

2.9. Origin

The origin and age of the family Hepadnaviridaeremains controversial. However, until the issues with the estimation of the substitution rate of HBV [41] are overcome, the debate on the origin of HBV will continue ([17, 41] and references cited therein). Nonetheless, bioinformatics, coupled with growing number of hepadnaviral sequences in the databases, with accurate sampling times, and advances in phylogenetic and coalescent methodology [42], is beginning to shed light on this issue. For example, according to Suh and colleagues [43], analysis of the endogenous sequences in the zebra finch provides direct evidence that the compact genomic organization of hepadnaviruses has not changed during the last 482 million years of hepadnaviral evolution. Furthermore, phylogenetic analyses and distribution of HBV relics suggest that birds potentially are the ancestral hosts of the family Hepadnaviridaeand that mammalian hepatitis B viruses probably emerged after a bird–mammal host switch [43].

2.10. Evolution

Genetic variation is important in viral evolution. The sequence heterogeneity displayed by HBV because of the lack of proof-reading ability of the polymerase is limited by functional constraints [33], leading to non-random variation [44]. Moreover, mutations can be affected by host–virus interaction and selective pressure, imposed endogenously by the immune system and exogenously by vaccination and antiviral treatment [17]. Phenotypic resistance to antiviral drugs occurs because of mutations in the reverse transcriptase of POL, whereas mutations in the BCP/preCand preSregions have been implicated as risk factors for the development of HCC. Mutations in the Sregion coding for HBsAg can lead to both vaccine and detection escape of HBV. At any time, the virus population can be composed of a number of different mutants referred to as “quasispecies” [45]. Direct sequencing and more recently next generation sequencing (NGS), parallel with bioinformatics, provide us with powerful tools to study the evolution of the various HBV mutations. NGS or ultra-deep sequencing generates large volumes of data, which can only be analyzed using bioinformatics tools and provides large coverage that can detect minor quasispecies populations of HBV [4651] that may be important in understanding HBV pathogenicity and response to treatment. In order to minimize the number of artifactual calls of single-nucleotide variations in NGS, it is important that the correct reference sequences are used [51, 52].

By designing a circular construct, Homs and co-workers [53] were able to use NGS to study evolution of both the precore and polymerase regions. They demonstrated the presence of precore mutants in HBeAg-positive phase, wild-type precore in the HBeAg-negative phase as well as lamivudine resistance strains in treatment naïve patients. This demonstrates that viral strains occurring at low frequencies can act as reservoirs or memory genomes, which are selected and evolve in response to both intrinsic (host immune response) and extrinsic (drug administration) factors.

2.11. Transmission and tracing human migrations

Sequencing and bioinformatics have played an important role in demonstrating transmission routes, for which previous evidence could only be anecdotal. For example, molecular characterization of HBV together with phylogenetic analysis was used to demonstrate inter-spousal transmission of HBV even after long marriages, in two Japanese patients, who developed acute liver failure [54]. Similarly, the first known case of transfusion-transmitted HBV infection by blood screened using individual donor nucleic acid testing was confirmed by the 99.7% sequence homology between the complete genome sequences of the donor and the recipient HBV strains [26]. When migration events were estimated by ancestral state reconstruction using the criterion of parsimony, it was shown that Africa was the most probable source of dispersal of subgenotype A1 of HBV globally and its dispersal to Asia and Latin America occurred as a result of the slave and trade routes [23, 55].

2.12. Treatment response and resistance to treatment

According to international chronic hepatitis B treatment guidelines, the most desirable endpoint of treatment is HBsAg loss. Following HBsAg loss, patients have better clinical outcomes, including decreased risk of developing cirrhosis and HCC, and death [56]. However, the currently available treatments, which include either nucleos(t)ide analogues (NAs) for direct inhibition of the viral polymerase or pegylated interferon (PegIFN) for immune-mediated HBV control, generally achieve HBV DNA suppression and HBeAg loss only, which are not enduring. In an attempt to identify viral factors associated with HBsAg loss, Charuworn et al. [57] demonstrated that viral diversity could differentiate those patients, who would lose HBsAg when treated with tenofovir disoproxil fumarate. Lower diversity was seen in the protein-encoding regions of HBV from patients who lost HBsAg compared to those who did not. On the other hand, higher diversity in regulatory elements of HBV was found to be a predictor of HBsAg loss [57]. These findings need to be confirmed by studies incorporating larger numbers of patients, as well as genotypes other than A and D.

The high mutation rate of HBV means that it can evolve to develop resistance against NAs that target the viral DNA polymerase. Drug-resistant mutants develop under drug pressure in order for HBV to survive in the presence of the NA. The development of drug resistance mutations can be affected by HBV DNA levels at baseline, rate of viral suppression, length of NA treatment, and prior exposure to NA treatment [58]. Sequential treatment with different NAs, following drug failure, can lead to the development of multidrug resistance, which cannot be treated using currently available drugs [59]. The most frequent lamivudine drug resistance mutants are rtM204V/I, which are also selected by the L-pyrimidine analogues, emitricitabine, clevudine, and telbivudine but are susceptible to the purine analogues adefovir and tenofovir [59]. rtA181V develops following lamivudine treatment but is sensitive to other NAs, whereas rtN236T is resistant to adefovir only. In deciding on treatment options, the detection of genotypic resistance, which is defined as the detection of viral mutations conferring drug resistance, is a priority in clinics. Direct and NGS of the polymerase region of the HBV genome can detect both well-defined and novel mutations.

Bioinformatics tools and databases have been used to better understand HBV mutations and how they develop, especially in response to antiviral therapy and vaccination. Although laboratory methods have been used to study mutations, they are both labor intensive and expensive and limited in the degree of complexity they can investigate. As a more economical alternative, bioinformatics and computer simulation can use available biological data, such as the protein sequence and structural information, to investigate interactions by virus, host, and the environment [60]. Thus, Shen et al. [60] showed that most mutations develop in the hydrophobic regions of HBsAg and POL and that the amino acids that are more likely to be mutated are serine and threonine [60]. Understanding how amino acids mutations develop in HBV proteins can facilitate the rational design of both vaccines and drugs [60], for the prevention and treatment of HBV infection, respectively. By the use of bioinformatics to compare viral and host genomic patterns, together with clinical information, to data from databases can lead to enhanced and individualized antiviral therapy.

3. Bioinformatics tools and databases

3.1. Bioinformatics challenges of HBV

Despite its small genome size of ~3.2 kb, HBV presents several bioinformatic challenges:

  1. The genome is circular, with position 1 conventionally taken to be the first “T” nucleotide in the EcoR1 restriction site (“GAATTC”). Historically, position 1 was the start of the “Core” region, which is position 1901 in the current numbering system. Therefore, a number of sequences deposited earlier in the public databases are numbered using this outdated system and thus require processing before they can be used in alignments, together with more recently submitted sequences.

  2. Four overlapping reading frames are encoded in the circular genome, whereas nucleotides or amino acids are sequenced and processed linearly. Extracting nucleotide or amino acid sequences for the Sand POLORFs, which span the EcoRIsite, from full-length or subgenomic fragments, requires additional processing.

  3. The differences in genome lengths between the nine HBV genotypes (ranging from 3182 to 3248 base pairs in length) mean that direct comparison of loci between genotypes is not always possible using the current numbering system. These differences in genome lengths result in genotype alignments containing several regions of gaps, ranging from 3 to 33 nucleotides in length. A possible solution is the implementation of a standardized “universal numbering system” for all HBV genotypes, which we are currently developing.

  4. Sequence variability is a feature of HBV. It is, therefore, essential to check all sequences carefully, to distinguish between artifacts and true variation (mutations). Variation within a population at a locus may result in two overlapping peaks on a chromatogram. Super-infections or co-infections with different strains may result in mixed populations, which appear as multiple or misaligned peaks on sequencing chromatograms. Disambiguating these is essential for robust downstream analyses.

3.2. Public sequence databases

The first public sequence database, “GenBank,” was established in 1982, having arisen from the earlier Los Alamos database, established in 1979 [61, 62]. Since then, the number of nucleotides in GenBank has doubled approximately every 18 months [63]. The International Nucleotide Sequence Database Collaboration (INSDC) is a collection of three publicly available nucleotide (DNA or RNA) sequence databases, which synchronize data daily [64]. The collection consists of the DNA DataBank of Japan (DDBJ, located in Japan), the European Molecular Biology Laboratory (EMBL, located in the United Kingdom) and GenBank (located in the United States of America). The latest release of the database (release 211.0, from 15 December, 2015; [65]) contains 189,232,925 loci and 203,939,111,071 bases, from 189,232,925 sequences, totaling approximately 742 gigabytes. In addition to the INSDC, many other databases exist, including genome databases, protein sequence, structure and interaction databases, microarray databases, and meta-databases. A list of biological databases on Wikipedia includes over 200 entries [66].

When searching for “hepatitis b virus” across all fields, the GenBank database [63], accessed on 27th January 2016, contained 105,745 sequences. When searching for “hepatitis b virus” in the “organism” field only, 84,119 sequences were found, with the oldest sequence submitted in the early 1980s. Refining this search to include only sequences of 200 nucleotides or longer, and excluding words such as “recombinant,” “clone,” and “patent,” resulted in 68,762 sequences. When this same query was previously executed on 29 November 2015, 67,893 sequences were returned. Therefore, in the 59 days between the two queries, 869 new sequences (of at least 200 nucleotides in length, and not containing the words mentioned previously) were uploaded to GenBank. On average, this equates to almost 15 new HBV sequences added to GenBank per day.

Making use of these sequences in downstream applications, such as multiple sequence alignments or phylogenetic analyses, is often challenging, as it is difficult to query for sufficient sequences, of the correct genotype, or subgenotype, and covering the required genomic region. In order to overcome this limitation, we have developed a bioinformatics solution, whereby all sequences matching a query are downloaded, curated, and aligned. The algorithm developed allows for the generation of a multiple sequence alignment for each genotype, which contains all the available sequences matching the query and in their correct position and orientation [67].

3.3. Bioinformatics tools for HBV

WorkflowTool name and description
ChromatogramsQuality score analyzer
Plots chromatogram quality scores
Automatic contig generator tool
Generates a contig from a forward and reverse chromatogram
AlignmentAutomatic alignment clean-up tool
Eliminates “gap-columns” and disambiguate ambiguous bases
Mind the gap
Splits FASTA file based on gap threshold per column
Extracts HBV protein sequences (ORFs)
Wild-type 2 × 2
Calculates 2 × 2 wild-type/mutant contingency tables
Divergence calculator*
Intra- and Inter-group divergence with custom groups
Generates random subsets from an input FASTA file
SerotypingHBV serotyper tool
Determines HBV serotype
PhylogeneticsPipeline: TreeMail
Generates a phylogenetic tree
GenBank SubmissionPadSeq
Places two HBV sequence fragments on a backbone template

Table 2.

List of the online tools developed and the workflow process at which each would be used.

Table modified from Bell and Kramvis [68].

*Described for the first time here.

A standard molecular biology laboratory workflow includes DNA extraction, polymerase chain reaction (PCR) amplification, direct DNA sequencing, viewing and checking of chromatograms, preparation of curated sequences, multiple sequence alignment, sequence analysis, serotyping, genotyping, phylogenetic analysis, and preparation of sequences for submission to the GenBank public sequence database [68]. Each of these steps presents data processing challenges, many of which have been addressed by the development of a suite of online tools (Table 2) [68].

Geno2Pheno [hbv] resistance mutations, escape mutant analysis[69]
HBV Blast Search, drug resistance database[70]
HBVRegDB, alignments, information about conserved regions[72]
HBVdb, annotation, drug resistance database[73]
Hepatitis virus database, sequence alignment and map viewing
Hepatitis virus database
NCBI genotyping tool[74]
Oxford HBV subtyping tool[70]
SeqHepB analysis, genotyping, detection of clinically important mutations[75]

Table 3.

Currently available HBV websites and databases.

Table modified from [67].

Any operating system platform from any location with an internet connection can be used to access stand-alone, web-based tools. There is no requirement to install and learn new bioinformatics software, as these tools can be used when required. A system for processing ultra-deep pyrosequencing (amplicon resequencing) data has also been developed [51]. In addition, a number of HBV-specific websites and databases are currently available, a selection of which are represented in Table 3.

3.4. New bioinformatics tools for HBV

Here, we present two newly developed tools for the bioinformatic analysis of HBV.

3.4.1. Divergence calculator []

One method of classifying HBV sequences into genotype or subgenotype is to examine nucleotide sequence divergence between sequences. This divergence calculation is performed by totaling the number of nucleotides, which differ, between two aligned sequences and computing the percentage difference. The divergence calculator (Figure 3) performs various divergence calculations on groups of sequences from nucleotide or amino acid multiple sequence alignments in FASTA format. A minimum of one group containing two sequences, or two groups containing one sequence each, must be specified.

Figure 3.

The input screen of the divergence calculator in which sequences are extracted and allocated to groups and other parameters specified.

As an example, consider an alignment of 10 genotype A sequences (group 1) and 10 genotype D sequences (group 2). Intra-group divergence, for each group, is calculated by comparing each sequence in group 1 with each other sequence in group 1 and then calculating the median, mean, and standard deviation of the divergences. This is then repeated for group 2. The inter-group divergence compares each sequence in group 1 with each sequence in group 2, and then calculates the median, mean, and standard deviation. If more than two groups are specified, the calculations iterate over all groups in turn.

If the optional “query” group is specified, the tool compares each sequence in the query group with each sequence in the other group or groups, but outputs statistics for each sequence in the query group individually. This method would typically be used with a set of unknown query sequences and one or more groups of reference sequences. A comprehensive list of descriptive statistics is included on the output page for each analysis.

3.4.2. Random FASTA extraction and allocation (RAFAEL) []

In some analyses, particularly when constructing phylogenetic trees, it may be desirable to extract one or more random subsets of sequences from a master or reference alignment. The “RAFAEL” tool was designed to perform this task. This tool takes an input file in FASTA format, which does not have to be aligned and generates one or more subsets of the file, each containing a random selection of the specified number of sequences. The number of sequences may be specified as a count, or as a percentage of the number of sequences in the input file. There are guaranteed to be no duplicate sequences within each subset. However, duplicates may exist in multiple subsets, as subsets are not unique.

3.5. Open-source software

In addition to biological databases, a large variety of biological analysis software, which is generally genome agnostic, is available. As with software in any field, the licensing terms and commercial costs of these packages vary widely. Packages, which may be free of cost, may not necessarily be open-source, for example.

The Free Software Foundation (FSF) [76, 77] defines free software as software which “respects the users’ freedom” in the sense that “users have the freedom to run, copy, distribute, study, change, and improve the software”. As such, “free” is “a matter of liberty, not price”. Free software, therefore, does not necessarily have to be made available at no cost or be a non-commercial project. Furthermore, software, which is provided at no cost, may not be “free” in the sense described above.

The term “open-source” is often used when referring to “free” software. However, the two terms are not synonymous, although there is some overlap. Open-source software may, or may not, be free software, depending on the restrictions placed on users by the software. If the user is not free to distribute, change, and improve the software, even if it is open source, then it cannot be considered to be free software. Most software, for which a license is purchased, is not free, or open source. The user does not have the freedom to distribute the software, or to use it on any computer chosen.

3.6. Recommended software

A list of recommended freely available download software is presented in Table 4. Comprehensive lists of open-source bioinformatics software can be found elsewhere [78].

Software name Software
Website (http://)Lin* Mac* Win* References
Unipro UGene Integrated bioinformatics suite  ugene.netYesYesYes[79]
MEGA 6Integrated bioinformatics suite; command-line version availablemegasoftware.netEmuYesYes[80]
BioEditMultiple sequence alignment viewer and
SeaViewMultiple sequence alignment viewer and editor, and molecular phylogenetics; opens GeneDoc “MSF”[81]
AliViewMultiple sequence alignment viewer and[82]
GeneDocMultiple sequence alignment viewer and editor, and shading
PHYLIPPrograms for interring phylogenies; website includes comprehensive list of other phylogenetics[83]
MrBayesBayesian inference and model choice using Markov Chain Monte Carlomrbayes.sourceforge.netComYesYes[84]
BEASTBayesian analysis of molecular sequences using Markov Chain Monte[85]
FigTreeGraphical viewer and editor of phylogenetic
ArchaeopteryxGraphical viewer and editor of phylogenetic[86]
EMBOSSA suite of command-line tools for molecular
JalViewMultiple sequence alignment viewer and editorwww.jalview.orgYesYesYes[87]
Finch TVDNA sequence trace (chromatogram)
ChromasDNA sequence trace (chromatogram)

Table 4.

Bioinformatics software available free of charge for various computer operating system platforms.

*“GUI” = graphical user interface, “CL” = command line interface, “OSS” = open-source software, “Lin” = GNU/Linux, “Mac” = Apple MacIntosh, “Win” = Microsoft Windows, “Emu” = emulator or virtual machine recommended by authors, “Com” = compilation from source code required.

4. Conclusion

The unique genome structure and molecular biology of HBV pose a number of challenges, and thus, the development of bioinformatic tools has facilitated a more comprehensive and detailed analysis and understanding of the origin, evolution, transmission, and response to antiviral agents of HBV and its interaction with the host. There are a wide range of free and commercially available tools, which have been developed for different applications. The availability and applications of high-throughput sequencing techniques and the advancement of “-omics” will continue to provide additional challenges, which will need to be addressed by further computational solutions.


Trevor Bell is the recipient of a National Research Foundation (NRF) Scarce Skills Post-Doctoral Fellowship (GUN#86215) and Anna Kramvis received funding from the National Research Foundation (GUN#65530, GUN#93516).

© 2016 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Trevor Graham Bell and Anna Kramvis (July 27th 2016). The Study of Hepatitis B Virus Using Bioinformatics, Bioinformatics - Updated Features and Applications, Ibrokhim Y. Abdurakhmonov, IntechOpen, DOI: 10.5772/63076. Available from:

chapter statistics

2158total chapter downloads

3Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Bioinformatics: A Way Forward to Explore “Plant Omics”

By Mehboob-ur- Rahman, Tayyaba Shaheen, Mahmood-ur- Rahman, Muhammad Atif Iqbal and Yusuf Zafar

Related Book

First chapter

Virtual Plant Breeding

By Sven B. Andersen

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us