With the advent of dynamical omics technology, especially the transcriptome and proteome, a huge amount of data related to various diseases and approved drugs are available under multi global projects or researches with their interests. These omics data and new machine learning technology largely promote the translation of drug research into clinical trials. We will cover the following topics in this chapter. 1) An introduction to the basic discipline of gene signature-based drug repurposing; 2) databases of genes, drugs and diseases; 3) gene signature databases of the approved drugs; 4) gene signature databases of various diseases; 5) gene signature-based methods and tools for drug repositioning; 6) new omics technology for drug repositioning; 7) drug repositioning examples with reproducible code. And finally, discuss the future trends and conclude.
- drug repurposing
- mode of action
- reproducible study
Drug repositioning is to identify new indications of the approved drugs. It has lower risk, less human resources, lower cost, and shorter developmental period, compared with traditional drug development. Sir James Black, a Nobel Prize laureate, originally stated that “The most fruitful basis for the discovery of a new drug is to start with an old drug”, largely promoting the concept of drug repositioning . There are huge examples of drug repositioning as described in the book. Multinational pharmaceutical companies, such as AstraZeneca and GSK, also showed their great interest in drug repurposing approaches [2, 3].
In this chapter, we focus on gene signature-based drug repositioning. The idea could date from 2000 year. Hughes et al. built a prototypical library of the microarray-based gene expression signatures of Yeast with about 300 diverse gene mutations and the treatment of 13 drugs with known molecular targets by keeping other experimental conditions consistent . They identified a new target of the drug dyclonine by comparing the signatures of genes and drugs via pattern matching . This article opened a door for gene signature-based drug repositioning .
A comprehensive gene signature library of genes, diseases and perturbations plays a fundamental role in gene-signature-based drug repositioning. From the genes’ view, the knocking down, knocking out, knocking in genes could be achieved to represent the expression signatures of genes with the advances of molecular biology, especially the emergence of the RNAi and CRISPR/Cas9 technology .
From the diseases’ view, modeling disease in a cell or animal experimental assay would make it possible to produce the gene signatures of various diseases via the quantification of molecular phenotypes. It should be noted that modeling various diseases in parallel and high throughput ways are relatively difficult so far as the condition of modeling various diseases is disease-specific or unclear due to the complexity and our little understanding of some diseases. However, with the development of the pathogenesis of various diseases, it will be efficient to model cellular and animal models of various diseases by magic genome editing using CRISPR/Cas9 technology .
Finally, from a drugs’ view, there are thousands of approved drugs available so far. Lots of the bioactive compounds, besides the approved drugs, were also tested to obtain their gene signatures. Particularly, the connectivity map (CMap)  and Library of Integrated Network-based Cellular Signatures (LINCS) program [9, 10] largely promoted the rapid development of drug repositioning as they provided a huge of gene signatures of drugs and compounds freely available to the scientific community.
The core principle of gene signature-based drug repositioning is that the candidate drugs should revert the gene signature of the disease of interest, which is changed by the disease, compared with the controls (Figure 1). The reversion could be characterized by anti-correlation, distance, similarity and metrics produced by machine learning models. A derivative principle is that the similarity of two drugs could reveal similar indications of the two drugs. In detail, if drug A could be used to treat disease C, and the other drug B is similar to drug A based on their gene signatures, then drug B could also be used to treat disease C. This idea should come from chemoinformatics as the principle that similar drugs based on chemical structures should have similar functions is widely used in the field of drug research and development, especially the development of me-too drugs . Importantly, several researchers have developed or detailed this principle from different perspectives, making this idea efficient to implement and use.
The gene signatures are the molecular phenotype, revealing the molecular landscape of genes, diseases or drugs. In general, the gene signatures are the expression profiles or changes of RNA measured by RNASeq-based transcriptome via microarray, Next-Generation Sequencing or Third-Generation Sequencing [5, 8]. More broadly, the gene signature could be the abundance profiles or changes of proteins qualified by the antibody-based or tandem mass spectrometry (MS/MS)-based proteome. The reason why is that the principle of gene signature-based drug repositioning is suitable to any molecular phenotype, such as the transcriptome and proteome. Moreover, in machine learning models, the tabular data of transcriptome and proteome is similar to a great extent as they are features of samples in a high-level and united view.
In summary, with the rapid advance of various omic technology, a huge amount of public available omic data related to molecules, drugs, diseases and genes, computational resources and efficient deep learning algorithms make the field of drug repositioning vigorous. There will be increasing therapeutic applications of drug repositioning. In the following sections, we will introduce the databases related to genes, pathways, drugs and diseases, providing the resources for gene signature-based drug repositioning, then describe key tools for web servers for drug repositioning with a highlight on the new powerful and easy-to-use methods, show examples for drug repositioning for several diseases with reproducible code, convenient to the readers to follow. Finally, we will summarize the ongoing challenges, unmet needs, future trends and conclude.
2. Databases of genes, pathways and drugs for drug repositioning
Genes play a critical role in gene signature-based drug repositioning. Especially, the targets of drugs are of importance in traditional drug development. In General, the targets of drugs are human or viral proteins, which are druggable  and associated with a particular disease or multi diseases. So far, there are about 900 biomolecules targeted by about 1500 US FDA-approved drugs as curated by Rita et al. . Obtaining this information will facilitate the process of gene signature-based drug repositioning. Some databases and web servers have gene information, which are useful in drug development .
GeneCards (https://www.genecards.org/) is an integrative knowledge base and web server with comprehensive information on all human genes, scratching more than 150 high-quality web sources, from genotype to phenotypes and functional information . Though it is a general database, which is not centric on drug development, it provides comprehensive knowledge about a gene of interest. It is highly recommended to browse this website at the beginning of a study of a target.
DGIdb (drug-gene interaction database, www.dgidb.org) is a webserver with drug-gene interaction and druggable genes information, collected from more than thirty high-quality web sources . If biomarkers or therapeutic targets are identified, then researchers could search which drugs could target the biomarker or therapeutic target using DGIdb, achieving a quick translational opportunity.
The Open Targets database (https://www.opentargets.org/) aims to identify and prioritize promising therapeutic targets of drugs by analyzing human genetics, genomics and functional genomics data [17, 18]. The database emphasizes the importance of genetics of diseases via genome-wide association studies to approach gene causal inference, which is beneficial to drug development [19, 20].
The Clue.io webserver (https://clue.io/) includes the updated CMap LINCS gene expression resource perturbed by CRISPR gene over-expression, RNAi gene knockdown and CRISPR gene knockout generating loss-of-function mutants [9, 21]. This webserver has abundant data about the gene perturbation, providing a great resource to study the effect of a target, mimicking the targets affected by drugs [22, 23, 24]. Meanwhile, it also supplies a drug repositioning hub for researchers, a curated library of drugs with a companion knowledge resource .
Pathways, besides gene level, could also be a key resource in drug repositioning. Pathway, consisting of a set of genes, could be the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, gene ontology (GO), Reactome Pathway Database (https://reactome.org/) and other gene sets. As genes in a pathway are not randomly selected, a generalized pathway concept is the gene set, substantially enlarging the function aspects of pathways. A good resource of the gene sets is the Molecular Signatures Database (MSigDB, https://www.gsea-msigdb.org/gsea/msigdb/) as it supplied a downloadable gmt-formatted gene set dataset, facilitating its use in the bioinformatic analysis . Several reasons highlight the importance of the pathway. Firstly, it could be used to illuminate the mode of action of drugs by connecting the genes and drugs . Secondly, it could be a feature summarizing the gene-signature at a higher level, which is useful in machine learning-based modeling. It is different from the gene level as it captures different information about drugs or diseases [28, 29, 30]. Thirdly, the pathway analysis could enhance the confidence of the prediction of the candidate drugs .
The information about drugs is an invaluable resource to drug repositioning and an evaluation dataset of drug repositioning. The repoDB database is a standard dataset to benchmark various computational repositioning methods, which consist of 6677 approved and 4123 failed drug-indication pairs . The Experimental Knowledge-Based Drug Repositioning Database (EK-DRD, http://www.idruglab.com/drd/index.php) curated 1861 FDA-approved and 102 withdrawn drugs with validated drug repositioning annotations . These datasets will facilitate the training and testing of the machine-learning-based models.
3. Gene signature databases related to drugs
The gene signature databases of drugs and compounds are fundamental resources determining the searching space for drug repositioning. For a long time, researchers have been pursuing the enlargement of the gene signature library of drugs and compounds. For example, researchers have explored a bunch of bioactive compounds and ligands, such as growth factors and cytokines, which are not drugs but with known functions [8, 9, 10]. There are lots of data resources related to drugs. The sources of these data are mainly from two aspects. One is the public data, such as GEO, which is scattered in the database. A manual curation by professional researchers is necessary to make a usable dataset for drug repositioning. There is a trend for advanced metadata curation from the GEO . The other one is from large projects, such as CMap, aiming to create a reference dataset of gene signatures for drug development.
NCBI GEO , EMBL-EBI ArrayExpress  and NGDC Gene Expression Nebulas  store massive omics data, including many transcriptome data of drugs and other compounds. But researchers need to search, collect and tidy them before their use for drug repositioning. Fortunately, several groups have collected multi-gene expression signatures related to the drugs.
The CREEDS (CRowd Extracted Expression of Differential Signatures) extracted and analyzed the signatures of 875 drugs and 828 diseases from GEO via a crowdsourcing project, setting in a massive open online course on Coursera . The dataset could be downloaded from the website, https://maayanlab.cloud/CREEDS/.
HERB (http://herb.ac.cn) is a high-throughput experiment database of traditional Chinese medicine, consisting of 7263 herbs and 49,258 ingredients, from 472 high-throughput GEO datasets, providing complementary and valuable drug resources .
The CMap version 1 ((https://portals.broadinstitute.org/cmap/) consists of Affymetrix-based 6100 gene signatures of 1309 compounds perturbing five different cell lines (such as PC3, MCF7, HL60) with varying doses (mainly 10 μM). Notably, there were 164 distinct perturbagens, including approved drugs and nondrug bioactive compounds, in the original article published in the
The CMap version 2 (https://clue.io/cmap), belonging to NIH’s Library of Integrated Network-Based Cellular Signatures (LINCS) program, includes 1.3 million L1000 profiles and 25,200 unique perturbations on variable cell lines . They used L1000 technology due to the cost and argued that about 1000 landmark genes could recover 82% of the information in the full transcriptome based on a comprehensive comparison . As expected, the updated dataset also motivated the continual development of drug repositioning. It should be noted that the consistency between the two versions of CMap is not high with a low recall . It suggests that drug repositioning based on the CMap should consider other evidence to filter false positives in the computational drug repositioning.
In summary, the availability of huge gene signatures of drugs makes the gene signature-based drug repositioning possible as a big data basis. Meanwhile, researchers are still developing new transcriptome technology to make the large-scale transcriptome sequencing of millions of drugs treating different cell lines with various doses possible at a relatively low cost. In addition, with the cost of conventional RNASeq lower, it is also possible to use the RNASeq directly soon.
4. Gene signature databases of various diseases
The gene signature databases of various diseases are a complementary resource to drug repositioning. Importantly, the gene signatures of diseases are robust across different tissues and experiments to some extent (Dudley et al. 2009). As mentioned in the introduction section, it is difficult to apply a high-throughput way to model various diseases in parallel. Researchers have collected some gene signature datasets related to numerous diseases. However, in practice, biologists usually focus on a specific disease, which means that they could obtain the gene signature of the disease by themselves. Once they have the gene signature of the disease, they could directly query the gene signature library of drugs to get the candidate drugs for this disease.
The gene signatures of diseases were mainly collected from the GEO. ADEPTUS (Annotated Disease Expression Profiles Transformed into a Unified Suite) supplied about 14,000 ready-to-use gene signature profiles, annotated with Disease Ontology terms . ADEPTUS built a classic way to form a gene signature of various diseases. The STARGEO (Search Tag Analyze Resource for GEO) project generated annotations of disease-related samples in GEO to identify robust signatures of disease by meta-analysis via a crowdsourcing approach . It covered about 250 types of diseases and could be improved via the webserver. The DrugVsDiseasedata (Drug versus Disease data) package defined 45 gene signatures of diseases, such as Breast with Small-cell Lung, Cervical, Bladder and Prostate cancer, collected from GEO . Recently, Porcu et al. reported that differentially expressed genes reflect disease-induced rather than disease-causing changes in the transcriptome via the Mendelian randomization method. Thus, identifying the upstream genes, which cause the diseases, would be a promising direction in the transcriptome data of diseases.
Although, there are several gene signature datasets of diseases, more efforts are necessary to enlarge the library of the types of diseases. The disease ontology is a fruitful resource for reference when searching for a disease. With the scale of gene signatures of diseases increasing, there will be more possibility of connecting drugs and diseases as the searching space for the algorithm is expanded.
5. Gene signature-based methods and tools for drug repositioning
Once the gene signatures of drugs and diseases, as well as other useful information (such as the structure of drugs), are ready, we could make a computational drug repositioning analysis. In the end, it is to find a method to connect the drug and disease. This connecting method could be a similarity metric , community discovery, matrix factorization and completion, machine learning-based models and so on. A good method should significantly enrich true positive results and deplete false-positive results.
There are several biologist-friendly web servers, convenient to use without the need for programming. The CMap version 1 website is one of the most popular websites in the field of drug repositioning. The CMap version 2 website supplies a more fruitful website. The enrichr website (https://maayanlab.cloud/Enrichr/) also provides the drug repositioning module with the drug and disease libraries (for example, Drug_Perturbations_from_GEO_down gene set) [45, 46]. Biologists could easily use these websites for drug repositioning without programming.
The nonparametric Kolmogorov–Smirnov statistic, formalized in Gene Set Enrichment Analysis (GSEA), was used in the original CMap article, indicating its power [8, 47]. It tests whether the empirical distribution of data (a set of genes) is different from a reference distribution (such as a ranked gene list related to a drug). The nonparametric test simplifies the statistical test process, making it feasible to multi situations.
PAGE (parametric analysis of gene set enrichment) was more sensitive and less-computational than GSEA , which could be used to evaluate the similarity between two gene expression signatures. Dr. Insight used the concordantly expressed genes in a frame-breaking statistical model to connect the drug and disease . The eXtreme Sum (XSum) was a similarity scoring algorithm, which was developed by Jie et al. It showed a better performance than the KS statistic based on the area under the curve using 890 drug-indication pairs with 496 compounds and 238 disease signatures .
Network-based community discovery could exploit the similarity in gene expression signatures of drugs and identify the similar drugs, which should be clustered together . They also implemented a tool, MANTRA (Mode of Action by NeTwoRk Analysis), which was accessible and biologist-friendly at http://mantra.tigem.it . GPSnet (Genome-wide Positioning Systems network) associated the drug and the gene signature-based disease modules in the protein–protein interactome network . DeMAND (detecting mechanism of action by network dysregulation) developed a regulatory network-based approach to elucidate the MoA using gene expression signatures . Chemical Checker integrated five-level data of drugs, such as targets, morphology and gene expression signatures, to evaluate the similarity of the drugs via the dimensionality reduction and network embedding algorithm .
Cogena, co-expressed gene-set enrichment analysis, focused on the idea of targeting co-expressed genes instead of all the differentially expressed genes for drug repositioning . It empowered simultaneous, gene set knowledgebase-driven drug repositioning analysis and illustrated the mode of action of the predicted drug and disease pairs. Cogena has been widely used in drug repositioning for several diseases, including psoriasis, Coronavirus Disease 2019 (COVID-19) [56, 57], Crohn’s disease , periodontitis .
Machine learning, especially deep learning algorithms, are suitable to the gene expression signatures inherently. The low-rank matrix approximation and randomized algorithms were used in drug repositioning by filling out the unknown connection in the drug-disease pairs  The iDrug could reposition drugs via a cross-network embedding and transferring knowledge from the drug target information . DLEPS (deep learning-based efficacy prediction system) used one-dimensional convolutional neural networks to learn the relationship between the structure of drugs and gene expression signatures to predict drug efficacy . Clearly, with the advances of deep learning, especially the graph neural network, lots of innovative algorithms will be continually applied in the drug repositioning field to improve performance.
6. New high-throughput technology for drug repositioning
Researchers try to develop new high-throughput RNASeq technology to improve the precision of transcriptome with the constraint of cost. For example, the microarray was used in the first version of CMap, while the L1000 technology was used in the second version of CMap, that is LINCS with a more than 1000-fold scale-up of the CMap. Via a Luminex bead-based probe hybridization, the L1000 only measured the mRNA abundance of 978 “landmark” genes with the expression of the remaining gene inferred by a machine learning algorithm . This selection largely resulted from lowering the cost of obtaining the transcriptome of a huge scale of drugs and compounds.
RNA-Seq via Next-Generation Sequencing is a relatively new emerging technology in the drug repositioning field. Due to the higher cost, researchers tried to maintain the transcriptome performance when lowering the cost in several ways. For example, a subset of genes with a reduced representation of the transcriptome could be sequenced instead of all the mRNA. The L1000 technology used the most informative genes, named “landmark” genes . Deepak et al. argued that a knowledge-driven subset of 1500 sentinel genes could precisely predict pathway perturbations . RASL-seq (RNA-mediated oligonucleotide annealing, selection, and ligation) only measured hundreds of pre-defined genes in response to a set of 350 chemicals and their mixtures, which provided a cost-effective approach to quantify gene expression signature with a panel of marker genes . TempO-Seq, Templated Oligo assay with Sequencing readout, could determine the whole transcriptome via a targeted way, requiring less sequencing depth .
The pooled and low-depth Next-Generation Sequencing is another approach to lower the cost but maintain the performance. PLATE-seq (pooled library amplification for transcriptome expression) introduced the sample-specific barcodes, allowing pooled library construction in 96 wells and low-depth sequencing, which is about 15-fold less expensive than canonical RNA-Seq . DRUG-seq efficiently captured transcriptional changes with low-depth reads by importing cell barcode and Unique Molecular Index (UMI) in 384- and 1536-well format with fewer steps, compared with PLATE-seq . Notably, DRUG-seq also supplied an open-source R program analysis pipeline at Github recently . BRB-seq (Bulk RNA Barcoding and sequencing) used early-stage multiplexing to produce 3′ cDNA libraries for multi-samples, while with a lower cost . 3’Pool-seq was an optimized cost-efficient method of transcriptome profiling, which was also adapted for a 96-well plate format and ERCC spike-ins. Collectively, researchers have developed multi new transcriptome technologies while lowering the cost of sequencing to implement the RNASeq for large-scale samples, which could be due to the different doses, different treatments, and different periods of treatment.
Other types of gene signatures, such as the proteome and metabolome, could also be used in drug repositioning. Zhao et al. created a systematic map of protein-drug connectivity that compiled 210 clinically relevant protein signatures based on antibody-based proteomics technology in more than 12,000 cell-line samples in response to about 150 drugs . ProTargetMiner was a proteome signature library of 56 molecules in A549 cancer cell lines, forming a valuable tool in drug discovery . Benjamin et al. profiled the proteomes of five lung cancer cell lines (such as A549, Calu6 and Calu1) perturbed by more than 50 drugs based on the label-free proteomics platform . Moreover, an atlas (http://bbmri.researchlumc.nl/atlas/) of 87 drugs and 150 clinically relevant plasma-based metabolite associations will contribute to the drug development as well . Other omics data, besides transcriptome, related to drugs and diseases will promote the drug repositioning flourishing. In summary, new omics technology will precisely quantify the signatures related to drugs and diseases with a low cost, permitting the large-scale omics project, enlarging the searching library for drug repositioning.
7. Drug repositioning examples with reproducible code
Due to the pandemic of COVID-19 and no effective drugs for this disease, drug repositioning is a great way to combat this disease. Several researchers have used cogena for drug repositioning to fight the COVID-19 [56, 74].
We used the metatranscriptome data of the bronchoalveolar lavage fluid from 8 severe COVID-19 patients and 20 healthy controls to obtain the gene expression signature of COVID-19 . The co-expression analysis, pathway analysis and drug repositioning analysis were done using the cogena pipeline . We identified several drugs which were associated with COVID-19 reported before. For example, Saquinavir, a protease inhibitor, is a drug for human immunodeficiency virus infection. This drug was also identified by several docking methods . Dexamethasone is a “major development” in the fight against COVID-19 in the RECOVERY trial . Ribavirin can be used to treat SARS-CoV and MERS-CoV infections . Importantly, it is a recommended drug in the diagnosis and treatment protocol for COVID pneumonia (trial version 5–latest) published by the National Health Commission of the P.R. of China. It was also identified by several docking methods . Furthermore, we identified several other candidate drugs for COVID-19, for example, dinoprost, a smooth muscle activator, and (−)-isoprenaline, a bronchodilator for obstructive lung diseases. These candidate drugs could be tested in vitro and in vivo to validate their possibility.
The whole pipeline of this gene-signature-based drug repositioning for COVID-19 using cogena is accessible at https://github.com/zhilongjia/COVID-19 with data and code, forming a good resource for drug repositioning and reproducible study.
There are also other examples of drug repositioning using cogena with reproducible codes. For instance, the code of the drug repositioning for psoriasis is available at https://github.com/zhilongjia/psoriasis and the code of drug repositioning for periodontitis is available at https://github.com/zhilongjia/Fn_HGFcell. These examples will enhance our understanding of how drug repositioning works and how to implement drug repositioning.
8. Future perspectives and conclusion
The future of gene signature-based drug repositioning is bright. The booming biotechnology and pharmaceutical industry, especially the emerging sequencing and MS field, supplies an important motivation to sequence more omics data related to drugs and diseases. The artificial intelligence industry, particularly the deep learning algorithm, will also promote the rapid development of the drug repositioning field as it will improve the rate of the true positives and lower the rate of false positives. The omics data of drugs and diseases is like electricity, while the algorithm is like a machine. The seamless combinations of them will produce new opportunities for gene signature-based drug repositioning. More data means a larger searchable space to identify the new relationship between drugs and diseases. Additionally, the signatures-based combination of drugs could also be investigated to deal with intractable diseases. Meanwhile, more evidence from different aspects of the drug-disease pairs will improve the quality of perdition.
In the end, we highlight the key points of this chapter.
A systematic introduction to gene signature-based drug repositioning and the core principle of gene signature-based drug repositioning;
Gene signature could be achieved based on molecular phenotypes, such as transcriptome and proteome;
Basic databases of gene, pathway and drug for drug repositioning;
Gene signature databases of drugs and diseases
Gene signature-based methods and tools for drug repositioning;
New high-throughput technology for drug repositioning;
Drug repositioning examples with reproducible code;
The future direction of gene signature-based drug repositioning.
This work was supported by the National Natural Science Foundation of China [grant number 31701155].
Conflict of interest
The authors declare no conflict of interest.