Mosquito-borne viral diseases are infections transmitted by the bite of infected mosquitoes. The burden of these diseases is highest in tropical and subtropical areas and they disproportionately affect the poorest populations. Since 2014, major outbreaks of dengue, chikungunya, yellow fever and Zika have afflicted populations and overwhelmed health systems in many countries. Distribution of mosquito-borne diseases is determined by complex demographic, environmental and social factors, causing diseases to emerge in countries where they were previously unknown. Coupling genomic diagnostics and epidemiology to innovative digital disease detection platforms raises the possibility of an open, global, digital pathogen surveillance system. Considering pathogen surveillance in mind, real-time sequencing, bioinformatics tools and the combination of genomic and epidemiological data from viral infections can give essential information for understanding the past and the future of an epidemic, making possible to establish an effective surveillance framework on tracking the spread of infections to other geographic regions.
- mosquito-borne viral diseases
- arboviral infections
- genomics epidemiology
- next-generation sequencing
- genomic surveillance
- viral pathogens
Mosquito-borne viral diseases have lately integrated worldwide headlines since the emergence of arbovirus outbreaks in big urban areas. According to the World Health Organization, more than 17% of all infectious diseases registered worldwide are represented by vector-borne diseases, and they account for more than 700,000 deaths annually . Due to this scenario of increasing cases number and expansion to new areas, the spread of infectious diseases was listed second in the top 10 risks in term of impact according to the Global Risks 2015 report .
Mosquitos of the genus
Dengue and chikungunya are two arboviral diseases present in the list of neglected tropical diseases from the World Health Organization. Neglected tropical diseases are a group of diseases that have received insufficient public attention, strive in tropical and subtropical areas, and strongly affect populations living in poverty . It is argued that arboviruses can be considered a group of neglected tropical diseases, since they can have a long-lasting impact in the health and economic life of affected populations . Some studies have argued that socioeconomic factors and land-use changes associated with the effects of climate change and global travel, and trade modulate the dynamics of expansion of emerging e re-emerging mosquito-borne diseases [17, 18, 19, 20]. Movement of people between neighboring countries has been considered a good predictor for chikungunya spread in the Caribbean and Indian Ocean . The expansion of the geographic distribution of arbovirus has significant negative impact on public health in many regions of the world. As measures to reduce such impacts, it has been argued about the relevance to public health of the implementation of a surveillance system that monitors virus diffusion and the appearance of new genetic variants . In this sense, the use of genomic sequencing data and bioinformatics has been employed in the study of virus evolution, aiming to elucidate phylogenetic relationships and patterns of virus spread during an epidemic .
2. Genomic surveillance
Infectious diseases continue to be one of the leading causes of death worldwide  and pathogens such as viruses can evolve and spread rapidly, leading to the emergence of newly-mutated human pathogens, more virulent strains, as well as antibiotic and drug resistant organisms [24, 25]. In this context, genomic surveillance aims are to: (i) to perform global surveillance of pathogens using whole genome sequencing and (ii) to understand drug resistance, emergence and spread of viral pathogens. Several approaches have been developed and are widely used for the quick detection and identification of viral pathogens (i.e., diagnostics). Some of them are based on different serological and molecular strategies including, for example, assays based on real-time polymerase chain reaction . Even though these kinds of approaches present high sensitivity and specificity for their purpose, they are more suitable for diagnostics only and cannot provide detailed genomic information .
Bearing these limitations in mind, the main point of developing new genomic surveillance tools is to answer the following inquiry: what sort of questions is important for genomic surveillance that cannot be addressed by conventional RT-qPCR or serology? (i) RT-qPCR assays do not allow genotype classification, neither does it help identify particular and/or characteristic transmission routes; (ii) RT-qPCR assays also do not allow to determine how fast a viral pathogen is being transmitted and in what direction it is spreading; (iii) serological and molecular assays also cannot help identify epidemiologically linked individuals, neither predict future outbreaks; and (iv) finally, serological and some molecular approaches cannot help to identify novel pathogenic agents and are, therefore, unsuitable for pathogen discovery .
Next generation sequencing (NGS) technologies produce significantly more raw data than other molecular diagnostic assays, including Sanger sequencing, and are also capable of informing not just pathogen diagnostics but also epidemiology . This is why whole genome sequencing of viral genomes by using new technologies plays an important role in the fight against emerging and re-emerging epidemics [29, 30]. The availability of high-throughput sequencing has also provided immense insights into the ecology of health care-associated pathogens . Therefore, real-time sequencing of entire pathogen genomes has become a standard and indispensable research tool for the critical role of genomic surveillance in the prevention and control of emerging infectious diseases , which justifies why NGS can be considered a powerful strategy that also allows the discovery of novel potential viral pathogens [33, 34].
Considering pathogen surveillance in mind, bioinformatics tools and the combination of genomic and epidemiological data from viral infections can give essential information for understanding the past and the future of an epidemic, because genomic data generated by real-time sequencing can provide important information on how and when viruses were introduced in a particular site, their pattern and determinants of dissemination in neighboring locations and the extent of genetic diversity, i.e., its dynamics, making it possible to establish an effective surveillance framework on tracking the spread of infections to other geographic regions [21, 22, 34]. In this context, recently established international networks for real-time, portable genomic sequencing, genomic surveillance and data analysis made it possible to monitor the evolution of viral genomes, to understand the origins of outbreaks and epidemics, to predict future outbreaks and to assist in the maintenance of updated diagnostic methods [33, 34, 35]. Additionally, genomic surveillance framework allows to determine, through genome sequencing, the real-time molecular epidemiology of viruses circulating and co-circulating in different regions in a specific area, and also to detect and characterize the early emergence of new pathogens in large urban centers, generating data that can inform outbreak control responses [27, 34]. Generated data regarding the molecular, epidemiological, phylogenetic and geographical aspects of circulating viral pathogens in a specific setting contribute to a better understanding of those viral infections in a national and international context, assuming an important role in solving issues relevance to Public Health . As a result, studies involving more in-depth molecular and dispersion analysis of circulating pathogens may help the World Health Organization appropriately adopt measures to control epidemics and to monitor the dynamics and spreading of new viral strains. However, even though NGS has advantages over diagnostics routine, all of the different strategies and technologies, developed by Illumina, Thermo Scientific, Oxford Nanopore and others, are not yet considered a panacea. Remaining challenges include dealing with high data throughput, which requires sophisticated computational processing as well as the annotation of large amounts of sequencing data, high DNA or RNA input sample requirements (in some cases hundreds of nanograms), which often raises the need for previous PCR-based amplification approaches. On top of all this, there are relatively few researchers in the area with sufficient bioinformatics expertise and who are able to engage in near-patient or disease surveillance activities .
3. Bioinformatics tools and phylogenetic tools
The advent of next generation sequence (NGS) and advancements in bioinformatics present an opportunity to tap into new insights that are crucial to the establishment of an open, global digital surveillance system. NGS technologies have enabled the production and deposit of vast amounts of whole genomes into public repositories [36, 37, 38] ushering the field of genomics into era of big data. This has in turn increased the scale of genomic studies from the analysis of single or few genomes to an ever-increasing large number of genomes [39, 40].
Toward the development of global surveillance system, bioinformatics provides the tools to answer pertinent questions including the identification of organisms responsible for an outbreak, the source of an outbreak and evolutionary information of pathogens crucial for understanding the unique phenotypes such as drug resistance, virulence and disease outcome.
Several bioinformatics tools and pipelines have been developed to facilitate the processing, analysis and visualization of these data in order to derive useful information from it . The major fields of interest addressed by these tools include comparative genomics which involves comparing the genetic content of one organism against that of another; prediction of the function of genes and sequences of the coding regions; identification of evolutionary events and inference of phylogenetic relationships. These fields of study play a critical role in elucidating pathogen evolution, niche adaptation, population structure and host-pathogen interaction. Furthermore, these findings inform vaccine and drug design, as well as the identification of virulence genes.
4. Bioinformatics pipelines and workflows
Bioinformatics pipelines and workflows comprise of a series of third-party executable command line software assembled to perform a specific task or analysis. A complete pipeline will, therefore, be able to support the end of analysis of a given field of study such as phylogenetics or variant detection. Pipelines can thus be broken down into two major components i.e. the data processing component and the analytical component that performs the core analysis of the pipeline. Below, we review some of the prominent bioinformatics pipelines and workflows that support the processing and analysis of NGS data to provide insights on relevant global surveillance of arboviral outbreaks.
5. Virus discovery and identification tools
Viral discovery and identification from isolates and metagenomic samples present major challenges to bioinformatics in general. This is because viral genomes are prone to very high variability and deviation from reference genomes , continuous emergence of new viruses with no available references, high intrapopulation diversity, and the relative rareness of viral DNA fragments in metagenomic samples . These challenges have largely been addressed through the following pipelines.
5.1 Genome detective
Genome detective (http://www.genomedetective.com/app/) is an easy to use web-based software application that assembles the genomes of viruses quickly and accurately, designed to generate and analyze whole or partial viral genomes directly from NGS reads within minutes . The application gains accuracy by using a novel alignment method that uses a combination of both amino acids and nucleotide scores to construct genomes by reference-based linking of de novo contigs. Speed and accuracy are also gained by using DIAMOND  with a UniProt90 reference dataset to sort viral taxonomy units. The use of DIAMOND and UniRef90 allowed genome detective to identify viral short reads at least 1000 times faster than when Blastn and the viral nucleotide database of NCBI were used. The software was optimized using synthetic datasets to represent the great diversity of virus genomes. The application was then validated with next-generation sequencing data of hundreds of viruses.
5.2 VirusTAP: viral genome-targeted assembly pipeline
One of the major difficulties in this process is the correct de novo assembly of viral genomes from crude metagenomic deep sequencing reads, including large amounts of bacteria and human related sequencing reads. Such read contaminations often force the server to overload during de novo assembly and might cause misassembly of the resultant contigs. Pre-filtering by host-mapping subtraction could lead to efficient de novo assembly, allowing the rapid and accurate procurement of a complete viral genome sequence. In addition to the accuracy of de novo assembly, the exclusion of human-related sequences can circumvent conflicting ethical issues by avoiding analyzing the personal genetic information of patients [46, 47].
VirusTAP is web-based, integrated NGS analysis tool designed to facilitate rapid and accurate viral genome assembly from raw reads by just clicking on several selections. Like genome detective, it ensures that non-viral reads are eliminated prior to de novo assembly in order to ensure performance is not compromised.
5.3 Virus identification pipeline (VIP)
VIP (https://github.com/keylabivdc/VIP) is a web-based virus discovery and identification tool . With a single click, it will filter out background-related reads, classify reads on basis of nucleotide and remote amino acid homology, and perform phylogenetic analysis to provide evolutionary insights.
5.4 TAR-VIR: a pipeline for TARgeted VIRal strain reconstruction from metagenomic data
TAR-VIR is a non-reference based NGS analysis tool for the reconstruction of viral strains from metagenomic samples [46, 47]. It was developed to classify RNA viral reads from viral metagenomic data and also to produce the assembled viral strains (i.e. haplotypes) from classified reads. It mainly has two components: (1) viral read classification using partial or remotely related reference genomes and (2) de novo assembly of viral haplotypes from recruited reads with PEHaplo [47, 48], which is a haplotype reconstruction tool. As TAR-VIR has a modular structure, the users have options to use other assembly tools after read classification in step (1).
6. Genotyping tools
While variant discovery and identification tools play a critical role in determining the pathogen responsible for the infection, they are unable to determine the subtype or quasispecies that is responsible for the outbreak. Arboviruses exist as a mixed population of genomic variants due to rapid replication and the error prone nature of viral RNA-dependent RNA polymerase (RdRp) . Monitoring virus genotype diversity is therefore crucial to understand the emergence and spread of outbreaks. Genotyping tools provide an efficient workflow to enable researchers and public health practitioners to determine the strain that is responsible for the outbreak.
Most free-access bioinformatics programs used to classify the genetic profile of subtypes, genotypes, subgroups or groups of viruses are based on the use of similarity search tools to determine the genotype of a new sequence. These genotyping tools use a set of reference sequence genomes, carefully selected for the purpose of representing each individual genotype. The use of a number of reference sequences representing the genotype of a given group increases the consistency and reproducibility of the data, thus ensuring a higher speed in the search for the data and offering greater and more complete information while ensuring that the results are not limited to an inadequate set of reference sequences that do not represent the information needed to identify the virus.
The similarity-based methods are useful for identifying recombination patterns in viral sequences, but they need further confirmation of their own phylogenetic methods and have no statistical support for their results.
Recently , four viral genotyping tools for yellow fever (YFV) (https://www.genomedetective.com/app/typingtool/yellowfevervirus/), dengue (DENV) (https://www.genomedetective.com/app/typingtool/dengue/), Chikungunya (CHIKV) (https://www.genomedetective.com/app/typingtool/chikungunya/) and Zika (ZIKV) (https://www.genomedetective.com/app/typingtool/zika/) were developed and linked to genome detective to enable phylogenetic classification below species level [50, 51].
The classification and annotation of virus genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied families of viruses . Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families.
CASTOR is a virus classification platform based on machine learning methods, inspired by a well-known technique in molecular biology: restriction fragment length polymorphism . It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. The performance of CASTOR, its genericity and robustness could permit the conduct of novel and accurate large-scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at (http://castor.bioinfo.uqam.ca).
7. Phylogenetic and phylodynamic tools
Phylogenetic tools are an extremely important resource used in the field of virology to study viral evolution, trace the origin of epidemics, establish the mode of transmission, investigate the occurrence of drug resistance or determine the origin of the virus in different body compartments. Thus, the tools developed by bioinformatics are fundamental to monitor the evolution of viral diversity, supporting studies of genomic sequence analysis, crucial for the surveillance of viral polymorphism, the development of new therapeutic strategies, the development of vaccine products or the appropriate choice products. Toward the development of a global surveillance outbreak surveillance system, the advances below have been made.
7.1 Nextstrain (https://nextstrain.org/)
Nextstrain is a real-time pathogen evolution tracking platform that implements cutting-edge analysis and visualization of pathogen genome data . It provides evolutionary information in the form of interactive visualizations to virologists, epidemiologists, public health officials and citizen scientists. It has been used to track various arboviral epidemics globally including West Nile Virus (WNV) in the Americas, Zika virus in 33 countries and Dengue virus outbreaks in 64 countries. The platform is continually updated with publicly available datasets to provide new insights into viral epidemic outbreaks globally in an intuitive and visually esthetic manner.
8. Functional prediction tools
In disease surveillance, understanding the effect of mutations detected in the viral genomes through the methods identified above is invaluable in the development of relevant controls and interventions . Many of these mutations serve as drug targets as well as provide insights into the response mechanism of the pathogens to existing interventions. A global surveillance system would therefore be incomplete without the capability to provide insights to the function of discovered mutations. Below we explore some of the tools that have been applied to understand the functional relevance of mutations found in arboviruses.
8.1 The SIFT (sorting intolerant from tolerant)
The SIFT algorithm predicts the effect of coding variants on protein function [54, 55]. Since its introduction in 2001, SIFT has become one of the standard tools for characterizing missense variation. It has a corresponding website that provides users with predictions on their variants.
Augmenting epidemiological data with insights from genomic data provides a powerful tool for surveillance and control of disease outbreaks. Advances in bioinformatics particularly leverage large genomic datasets to determine pathogenic organisms responsible for the outbreak, the origin of the infection and mutations responsible unique phenotypic traits. This information is crucial for effective planning interventions and combating outbreaks. An area of research interest that remains to be explored is the development of online platforms to perform functional analyses of statistically significant mutations in arboviruses. This information is invaluable in the development of vaccines and identification of drug targets.
This work was supported by the ZiBRA2 project supported by the Brazilian Ministry of Health (SVS-MS) and the Pan American Organization (OPAS) and founded by Decit/SCTIE/MoH and CNPq (440685/2016-8 and 440856/2016-7); by CAPES (88887.130716/2016-2100, 88881.130825/2016-2100 and 88887.130823/2016-2100). MG is supported by Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro—FAPERJ.
Conflict of interest
The authors declare no conflict of interest.
Appendices and nomenclature
RT-qPCRreal time quantitative polymerase chain reaction
NGSnext generation sequencing
VIPvirus identification pipeline
RdRpRNA-dependent RNA polymerase
YFVyellow fever virus
WNVWest Nile virus
SIFTsorting intolerant from tolerant