Artificial intelligence (AI) has been termed the machine for the fourth industrial revolution. One of the main challenges in drug discovery and development is the time and costs required to sustain the drug development pipeline. It is estimated to cost over 2.6 billion USD and take over a decade to develop cancer therapeutics. This is primarily due to the high numbers of candidate drugs failing at late drug development stages. Many sizable pharmaceutical and biotech companies have made considerable investments in AI. This is primarily due to recent advancements in AI, which have displayed the possibility of rapid low-cost drug discovery and development. This overview provides a general introduction to AI in drug discovery and development. This chapter will describe the conventional oncology drug discovery pipeline and its associated challenges. Fundamental AI concepts are also introduced, alongside historical and modern advancements within AI and drug discovery and development. Lastly, the future potential and challenges of AI in oncology are discussed.
- drug discovery
- drug development
Artificial intelligence (AI) has been termed the machine for the fourth industrial revolution. AI is anticipated to transform every industry. In drug discovery and development, the key challenges are the time and costs required to sustain the drug development pipeline. It is estimated to cost over 2.6 billion USD and take over a decade to develop an oncology therapeutic . These soaring costs are mostly a result of money invested in the 90% of candidate therapies that fail at the late stages of drug development, between phase 1 trials and regulatory approval . AI is projected to be the foundation for an era of quicker, cheaper, and more efficient drug discovery and development.
Recent advancements in AI are displaying the possibility of rapid low-cost drug discovery and development. The term AI broadly describes the ability of a machine to perform tasks commonly associated with intelligent beings. Another term, machine learning (ML), is a subset of AI involving machines using data to artificially think for themselves. The main difference between ML and AI is that ML is the direct application and involves the combination and analysis of complex, disparate data sets.
Within the pharmaceutical industry, experts agree that AI will revolutionize and change how drugs are discovered. There are many components, directly and indirectly, related to the drug discovery and development that AI can enhance. These include but are not limited to: the use of AI in tumour classification , computer-aided organic synthesis , compound discovery , assay development, and biomarker and target discovery [6, 7, 8]. In general, AI aims to automate and optimize slow processes to substantially speed up the R&D drug discovery process.
Several pharmaceutical, biotech, and software companies are also making every effort to integrate AI with drug discovery and development. In 2016, Pfizer partnered with IBM Watson Health, an AI platform, to enhance their search for immuno-oncology treatments. Sanofi paired with Dundee university spin-out Exscientia, to discover metabolic-disease therapies. In 2009, Roche acquired Genentech for $46.8 billion, providing a foundation for Roche’s biotechnology division, which is not integrating AI. Genentech is now collaborating with GNS Healthcare platform to use machine learning to find and validate potential new drug candidates. Recently, Genentech displayed the capacity of AI to diagnose diabetic macular degeneration.
Even large traditional tech companies are investing in drug development. Alphabet’s subsidiary DeepMind developed an AI platform, AlphaFold, that predicted protein 3D structures based upon genomic data; their prediction was better than over 90 other companies including Novartis, and Pfizer, in the 13th Critical Assessment of Structure Prediction. DeepMind’s success with AlphaFold is displaying how non-healthcare companies can also contribute to and improve the drug discovery and development pipeline. These investments are forming a clear vision that AI will play an important role in future drug discovery and development.
In this overview, we start with introducing key components of conventional oncology drug discovery, and associated shortfalls. Following this, fundamental AI concepts are introduced, alongside historical and modern advancements within AI and drug discovery and development. Lastly, the future potential and challenges of AI in oncology are introduced.
2. Conventional oncology drug discovery and development
The conventional drug discovery and development pipeline has five key components: target identification, lead discovery, preclinical development, clinical development, and regulatory approval (Figure 1). A drug discovery program initiates after researching the inhibition or activation of a protein or pathway and explaining the potential therapeutic effect. This leads to the selection of a biological target, often requiring extensive validation prior to the lead drug discovery phase. This phase involves the search for a viable drug-like small molecule or biological therapy, termed a development candidate. The drug candidate will progress into preclinical development, and if successful into clinical development.
2.1 Drug discovery and development pipeline: target identification and validation
Biological target identification and validation is a fundamental step in drug discovery. A biological target is a broad term, used to describe a variety of entities including proteins, metabolites, and genes. A biological target must have a clear effect, meet clinical and therapeutic needs, as well as industry needs. Above all, a biological target must be ‘druggable’. The term ‘druggable’ refers to a target that can be bound by a small molecule or larger biologic and elicit a response.
2.1.1 Target identification
A variety of methods exist to identify biological targets. This includes gene expression, proteomics and genomics analysis, and phenotypic screening.
The analysis of mRNA/protein expression is often employed to elucidate expression to disease relationships if changes in expression levels are correlated with exacerbation or progression. At the genetic level, targets are identified by determining if there is an association between genetic polymorphism and disease occurrence or progression. For example, one of the most well-studied genetic-disease associations is that of N-acetyltransferase 2 (NAT2) with bladder and colon cancer. N-acetyltransferase 1 (NAT1) and NAT2 are precursors of enzymes that mediate the transformation of two types of carcinogens, aromatic and heterocyclic amines. The NAT2 rapid acetylator phenotype and the slowest NAT2 acetylator phenotype are associated with colon and bladder cancer respectively [9, 10].
Phenotypic screening is another method for target identification. This can take a variety of forms. Generally, compounds are screened in cellular or animal disease models to identify a compound that leads to the desired change in phenotype. Kurosawa and colleagues  screened for overexpressed carcinoma antigens by isolating human monoclonal antibodies that bind to the surface of tumour cells. In this study, clones were screened with immunostaining. Clones that displayed strong staining with the malignant cells were selected. Subsequently, 21 distinct antigens were derived via mass spectroscopy. Several immunotherapies may be capable of binding to these 21 antigen targets, possibly leading to a new clinical therapy.
Target identification may involve one or a combination of the previously mentioned methods.
2.1.2 Target validation
While identifying a target typically requires one method, the following target validation requires a variety of methods. A multi-validation approach increases confidence in the biological target and subsequent drug candidate’s success.
There are a variety of target validation methods that may be implemented, although validation almost always requires target expression in the disease-relevant cells or tissues. A typical primary validation protocol is to measure the expression of protein and/or mRNA in clinical samples, with immunohistochemistry and in situ hybridization.
Generally, in vivo studies are often a pivotal factor in the decision to proceed with drug development; these usually involve protein inhibition/gene knock-out/knock-in studies. Transgenic animal models are particularly useful as they facilitate phenotypic observations. These animal models often yield insights into potential therapeutic side effects. Transgenic models traditionally gene edits whereby an animal would lack or obtain a certain gene(s) for its entire life. An example is the P2X7 knockout mouse model, which lacks an inflammatory and neuropathic response. These knockout mice revealed their respective mechanism of action, as their cells did not release the mature pro-inflammatory cytokine IL-1beta from cells, despite IL-1 beta expression remaining constant. Contrary to gene knockout models, are gene knock-ins models. In gene knock-ins, genes not originally in the mouse are inserted, and subsequent disease protein is synthesized. These transgenic animals usually have a different phenotype to a knockout and may mimic more closely what happens during disease and treatment.
Another in vivo technique used for target identification is antisense oligonucleotide-based models. Antisense oligonucleotides mimic RNA and complement the target mRNA molecule . Bound antisense oligonucleotide prevents ribosomal translation of mRNA to protein. Honore and colleagues created an antisense oligonucleotide that inhibited translation of the rat P2X3 receptor . When rat models were dosed with P2X3 antisense, they displayed anti-hyperalgesic activity. Once administration of the antisense oligonucleonucleotides was discontinued, receptor function and algesic responses returned. Unlike transgenic model, the antisense oligonucleotide effect is reversible .
While there are many viable target validation methods, two modern technologies can enable tissue specific validation: clustered regularly interspaced short palindromic repeats (CRISPR), and CRISPR-related techniques, and organs on a chip.
The CRISPR-Cas9 and related approaches provide multiple advancements compared to the transgenic model; these include the ability to overcome embryonic lethality and avoid resistance mechanisms. In brief, CRISPR-Cas9 works by distributing the Cas9 nuclease into the cell. Synthetic guide RNA then guides the nuclease to the desired cut location, facilitating the addition or removal of genes in vivo . An example of CRISPR-Cas 9 target validation is with the elucidation of the mechanism of action behind tumour suppressor, p53, reactivating compounds. Employing CRISPR-Cas9-based target validation in lung and colorectal cancer displayed that the anti-proliferate activity of nutlin is dependent on functional p53. However, using traditional models, the mechanism and therapeutic response to p53-reactivating compounds is lost via compound-specific resistance mechanisms .
Another emerging technology that will facilitate improved target validation is organs-on-chips. These are multi-channel 3-D microfluidic cell culture chips that mimic the functionality and physiology of entire organs. This technology yields the potential to quickly assess the efficacy and human response to target mediation. Song and colleagues used a vasculature system chip model to assess the relationship between vascular endothelium and the metastatic behavior of circulating tumour cells. This study suggested that the inhibition of CXCL12-CXCR4 binding on endothelial cells may be a valid target in the prevention of metastasis . Importantly, organs-on-chips technologies may provide novel insights to target identification and validation studies.
Overall, there are many means to validate targets; all strategies have a common aim: to evaluate the target’s cellular function prior to full investment into the target, and drug candidate screening.
2.2 Drug discovery and development pipeline: lead discovery
Once the biological targets have been identified and validated, the next fundamental step is the lead discovery phase. This comprises of three components, in order: hit identification, hit-to-lead phase, and lead optimization.
2.2.1 Drug discovery and development pipeline: hit identification
It is during this phase that drug compound screening assays are developed, and subsequent ‘hit’ compounds derived. The term ‘hit’ compound is used in a range of terminologies; in this overview we refer to it as a compound that obtains the desired screening effect, which has been validated upon retesting. Various screening approaches exist to identify hit molecules. In this overview, we will describe the most common screening strategies: high throughput screening, Focused based screening, and fragment screening.
High throughput screening utilizes an entire compound library and assesses the activity of each compound on the biological target. This typically involves large semi-automated cell-based assays. A candidate hit compound typically requires further assays to confirm its mechanism of action .
Focused based screening, also termed knowledge-based screening, selects compounds from a library based on existing information about the target, stemming from literature or patents, which suggest compounds likely to yield the desired target activity .
Fragment screening uses small-molecular weight compound libraries and screens these compounds at high concentrations. Small fragments that bind to the target are often scaled with chemical alterations to increase their binding affinity .
2.2.2 Hit-to-lead phase and lead optimization
The aim of this intermediate phase is to develop a compound(s) with enhanced properties, with pharmacokinetics suitable for one or many different in vivo models. This step regularly involves a series of structure-active-relationship (SAR) investigations for each hit compound, in an attempt to measure the activity and selectivity of each compound.
The goal of the final lead discovery phase is to obtain compounds with optimal structural, metabolic, and pharmacokinetic properties. This often involves further applications of various in vitro and in vivo screens.
2.3 Drug discovery and development pipeline: preclinical
Once a lead candidate is identified, further elucidation of its structure, metabolic, and pharmacokinetic properties may be required. The typical preclinical development stage is comprised of various components, typically used with animal models: (1) The first preclinical experiments revolve around dose design; a safe dose must be identified with estimated human measurements. (2) Second, the pharmacodynamics of a compound is required; the mechanism of action that causes the clinical response, with respect to doses, must be determined. (3) Third, pharmacokinetics properties of the drug candidate are required. This includes absorption, distribution, metabolism, excretion, and potential drug-drug interactions. The aim of preclinical studies is to obtain enough information to determine a safe dose for the first human study. On average, one in 5000 preclinical development candidate drugs make it through preclinical development and become regulatory approved .
2.4 Drug discovery and development pipeline: clinical development
The clinical development/clinical trial stage comprised of three main stages and one post-market surveillance stage.
The phase 1 clinical studies are carried out in a small number of healthy volunteers. The aim of this stage is to distinguish a therapy’s metabolic and pharmacological effects, as well as the side effect response to varying dosages. The main aim of phase 1 is to determine a therapy’s safety profile.
Stemming from the data collected during phase 1, phase 2 studies also termed ‘therapeutic exploratory’ trials involve investigations on several diseased individuals. This phase aims to further determine the effectiveness of the drug with respect to disease or condition. Side effects and risks are further distinguished. Phase 2 studies are controlled, usually conducted on a few hundred patients.
The phase 3 studies are a much larger drug assessment of the drug’s efficacy, safety, and evaluate the overall benefit-risk relationship of the drug. This phase may also yield enough data to estimate the results of a general population, as they include several hundred to several thousand people.
Once the drug is approved, there is a fourth phase, known as post-marketing surveillance. These are observational studies, whereby the goal is to define and ensure the safety profile of the drug on a larger population scale.
2.5 Drug discovery and development pipeline: challenges and overview
There are three main reasons why drugs fail: the first is that they simply do not work, second is that they are unsafe for clinical use, and the third reason of drug failure is due to poor clinical trial structure. The cost of a candidate soars the further it gets in the drug development pipeline.
The primary source of trial failure is a drug’s lack of efficacy. Hwang and colleagues investigated 640 phase 3 trials, of which 54% failed. Over 50% of these failures were due to a lack of efficacy . There are a variety of reasons why a drug may enter phase 3 trials and yet lack efficacy. This may also include the propagation of error due to flawed target validation, a poor study design, or simply having an insufficient number of patient trials resulting in weak statistical power and an inability to reject the null hypothesis.
The infamous, poly ADP ribose polymerase (PARP) inhibitor, Olaparib failed its first trial for ovarian cancer due to a lack of trial structure. In the initial trial, in individuals with the BRCA mutation and platinum-sensitive recurrent ovarian cancer, Olaparib delayed the time to recurrence to 11.2 months from 4.3 months. However, the median time to death was 34.9 months in the treatment group and 31.9 months in the control group (p = 0.19) . In 2014, Olaparib was approved by the FDA for women with recurrent ovarian cancer who have the germline BRCA mutation and had previously received three or more lines of chemotherapy. This approval was based on a study by Kaufman and colleagues , which displayed a response rate > 30% with Olaparib monotherapy in patients who had previously received three or more lines of chemotherapy.
Clinical trials also fail with respect to safety. In Hwang and colleagues’ study, out of the initial 640 compounds, 17% of them failed due to safety . Drug safety is a key factor in every stage of the candidate drug development; however, challenges may only present at larger populations . One reason for failure due to safety is due to ill reporting of safety concerns. Generally, a patient’s safety concerns may not align with that of the administering physician. It is logical to assume people will be more likely to report an adverse event that is of concern to them. It is important that at each step within the drug development pipeline safety is a primary consideration. The cost of determining a safety issue propagates with progression through each drug development stage.
One of the most impactful drug candidate failures was with sulphanilamide. This drug was popular in the 1930s and sold in both a bolus and elixir form. However, important safety tests had not been conducted for the elixir form, although at the time this testing was not required. Unfortunately, after being treated with the elixir form, over 100 people died due to diethylene glycol poisoning . This led to the implementation of two important acts: The Food, Drug and Cosmetic Act and Drugs and Cosmetics Act.
3. The potential of AI
AI has been utilized in drug discovery since the early 1960s. However, in 2016 many large pharmaceutical companies started investing in AI by partnering with AI startups or academic groups or initiating their own internal AI R&D programs. This has resulted in an enormous number of new publications within the field that cover the entire drug discovery and development pipeline. This has included the implementation of deep learning models to predict the properties of small molecules from transcriptomics data  to the identification of novel drug targets . AI has integrated into almost every area of drug discovery and development.
The primary aim of drug discovery and development combined with AI remains to facilitate the development of the best drugs and bring them to the clinic to fulfill unmet medical needs.
AI and machine learning has a lot of potential. For those new to the field, AI limitations seem endless, regardless of the input information. AI has a range of applications. It can be successful at creating an image of a cat from a model trained on images of cats or can enable a car to drive automatically without making a single mistake, or a drug that can be designed to treat a disease safely and efficaciously. However, AI will not succeed with every challenge; it is simply a tool that may drive new technologies, and enhanced understanding. In drug discovery and development, AI is not one entity that can design a drug from start to finish, but many different AIs which enhance our understanding throughout the drug discovery and development process.
3.1 Fundamental AI concepts
While many computational approaches can fit the broad definition of AI, two fields are currently popular: machine learning and its subfield deep learning. In layman’s terms, the key difference with deep learning is that it uses multiple layers, each employing different calculations on the initial data. In order to understand their capacities, a few fundamental concepts must be understood.
Broadly, there are two different types of machine learning to understand. Supervised learning is when a model is trained using labeled data sets to predict a certain outcome. An example of this is the quantitative structure–activity relationship (QSAR) approach. This is used to predict a chemical’s property, such as solubility and bioactivity . The other approach is unsupervised learning, as the name suggests, it does not depend upon training with labeled data to find relationships with data. Examples include the use of hierarchical clustering, algorithms and principal components analysis to analyze and group large molecular libraries into smaller sub-groups of similar compounds.
With supervised machine learning, there are two types: classification and regression. Classification models are used when the problem is categorical, as in the predicted output is a limited set of values. Regression models are used when the problem involves predicting a numeric value within a range.
There are a variety of different types of machine learning models, such as random forests, autoencoders, and convolutional neural networks. Each of the subsequent chapters will describe specific models as required.
3.2 Examples of AI implementations in drug discovery and development
A vast number of AI and drug discovery papers are published every day, covering various aspects of the entire drug discovery and development pipeline. Drug discovery and development-based AI technologies range from the identification and validation of drug targets, drug repurposing, identification of new compounds, and improving the R&D efficiency. There are a number of potential contributions AI can make to reduce inefficiencies in the conventional drug development and discovery pipeline.
Target identification and validation have been enhanced by AI. This is made possible by genomics, with biochemical and histopathological information. The IBM Watson identified five novel RNA-binding proteins as potential targets linked to the pathogenesis of amyotrophic lateral sclerosis, which currently has no known cure .
One huge opportunity for AI in drug discovery is with drug repurposing. As an example, Donner and colleagues  used a transcriptomics data set and derived a new measurement of compound functionality, based on gene expression. This measurement allowed the identification of compounds that shared biological targets, despite being structurally different, revealing previously unknown functional associations between compounds.
An AI platform that can predict a candidate’s mechanism of action and in vivo safety would cut wasted costs dramatically. There are several examples of companies with this goal. This includes DeeoTox and ProCTOR, both of which aim to predict the toxicity of new compounds [32, 33]. The performance of these AI platforms is expected to increase as larger robust data sets on the toxicity of compounds are made available.
As of 2019, one important study was the discovery of a drug within 21 days. Deep learning enabled the identification of potent DDR1 kinase inhibitors within 21 days. Out of the four compounds discovered, one lead candidate has displayed ideal pharmacokinetics in mice .
Overall, it is clear AI may yield increases in drug discovery efficiency through various strategies.
3.3 Current challenges in AI
AI has shown promise in drug discovery and development. However, it is not without its challenges. There are many challenges faced by AI in medical research such as lack of data, lack of interoperability, and the curse of dimensionality.
The lack of data is a recurring problem throughout every industry wanting to implement AI. The minimum number of samples in a traditional biological study is five, for it to be valid. However, most machine learning algorithms must be trained on hundreds, or thousands, of data points/samples, in order to perform well. Furthermore, obtaining labeled data can be a challenge, as this often requires some form of manual input. Fortunately, large databases, such as The Cancer Genome Atlas program (TCGA), are aggregating and open-sourcing vast amounts of robust data from multiple institutions. However, on some occasions, large databases that include the requested data may not exist. One such strategy to combat this is data augmentation. Data augmentation is the process of creating artificial data from real data. There are a variety of data augmentation approaches; ultimately they increase the data available for training models, without collecting new data.
Another challenge faced by machine learning is the lack of interpretability. The term ‘black box model’ is often used when it is difficult to explain how a model makes certain predictions and performs. This is more likely an occurrence with deep learning, as each layer adds complexity to the model explaining each layer’s outputs can become exponentially complex and the number of layers increases. However, a variety of tools are being developed in order to elucidate further explainability such as LIME (Local Interpretable Model-Agnostic Explanations) and SHAP (Shapley Additive Explanations). LIME adopts a local linear approximation of the model’s behavior, whereas SHAP employs a game theory-based approach to explain the model output. Both LIME and SHAP, and other similar strategies, are projected to become common practice in machine learning and are going to be necessary to get more AI technologies to the clinic .
A recurring issue with artificial intelligence in medical data is known as the curse of dimensionality. This is when the data sets used have a small number of samples and many features. This is a common occurrence in medical omics data sets, as they typically yield thousands of features and less than 100 samples; thus the available data become sparse. This problem may be addressed with a variety of dimensionality reduction techniques.
Overall, there are a series of challenges that will need to be addressed for AI to reach its optimal capacity. In this passage, we have only described a few challenges. However, they are being addressed with advancements in complementary data science approaches and tools, such as the creation of large data repositories, tools to increase explainability, and the creation of feature reduction techniques.
4. Concluding remarks
Taking a drug from idea to the clinic is a long diverse process, costing over 2.6 billion dollars, and take over a decade to develop a cancer therapeutic. This is primarily due to high numbers of candidate drugs failing at late drug development stages. Advancements in AI are continually displaying the possibility of rapid low-cost drug discovery and development. As we make our way through the 2020s, it is evident the drug discovery and development will be permanently shaped by AI.
The author would like to thank N.B.N and I.H.
Conflict of interest
The author declares no conflict of interest.
Acronyms and abbreviations
|CRISPR||clustered regularly interspaced short palindromic repeats|
|PARP||poly ADP ribose polymerase|
|TCGA||The Cancer Genome Atlas|