Computational analysis of genomic data has transformed research and clinical practice in oncology. Machine learning and AI advancements hold promise for answering theoretical and practical questions. While the modern researcher has access to a catalogue of tools from disciplines such as natural language processing and image recognition, before browsing for our favourite off-the-shelf technique it is worth asking a sequence of questions. What sort of data are we dealing with in cancer genomics? Do we have enough of it to be successful without designing into our models what we already know about its structure? If our methods do work, will we understand why? Are our tools robust enough to be applied in clinical practice? If so, are the technologies upon which they rely economically viable? While we will not answer all of these questions, we will provide language with which to discuss them. Understanding how much information we can expect to extract from data is a statistical question.
- high-dimensional statistics
- cancer genomics
- learning theory
This chapter should be equally approachable to those with a background in machine learning/statistics and those with a more biological background. Beginning with a contextualisation of cancer genomics as the starting point for drug and biomarker discovery, we will attempt to convince the reader that statistical theory serves as the backbone and language of modern developments in machine learning. In order to facilitate those with less experience in biology, we will provide a very brief introduction to the types of data encountered in sequencing-based studies and the opportunities and problems they present. After providing some terminology and useful concepts from high-dimensional statistics, we will discuss how these concepts arise naturally in the context of cancer genomics, with some illustrative examples of how different techniques may be employed in translational scientific research. We will conclude by providing sketches of some modern developments and a description of the transition from what can loosely be termed statistical learning to what nowadays is referred to as machine learning.
1.1 Cancer genomics in drug discovery
Since the success of the Human Genome Project , sequencing technologies have improved at an exponential rate, both in terms of cost per megabase sequenced and the number of individuals who have had some portion of their genome sequenced (although the cost remains higher in practice than often reported) . This has introduced an invaluable new resource for biomedical research in general. For the study of cancer, a disease of the genome, the ability to rapidly and cheaply sequence normal and tumour-derived DNA has transformed basic research, birthing the field of cancer genomics. This is beginning to impact frontline clinical oncology . Whole genome sequencing is not yet standard of care for the generic cancer patient, but access to in-depth genetic data is becoming more common. Initiatives such as the 10,000/100,000 Genomes Projects  and The Cancer Genome Atlas  have given researchers access to large clinical datasets with a variety of accompanying omics data.
Understanding the genomic landscape of cancer genomes is critical to the drug discovery pipeline , particularly in pre-clinical identification of targets and biomarkers. Knowledge of the location and associated products of oncogenes (genes in which mutation can cause a cell to become cancerous) can allow for intelligent selection of druggable sites and identification of tumour suppressor genes (genes that under normal circumstances prevent uncontrolled cell division) gives options for therapies which may replace patients’ defective cell cycle control mechanisms. Alongside new drugs, it is becoming increasingly common for therapies to be offered alongisde genomic biomarkers, which may stratify patients who are more likely to benefit from the treatment [7, 8].
These new sources and types of data allow researchers a greatly expanded toolbox with which to investigate the causes and development of cancer, but also present a unique set of challenges. The number of covaraiates in omics datasets causes a variety of theoretical and practical problems for classical statistical analysis, a problem often referred to as the curse of dimensionality .
1.2 Statistical learning and machine learning
Informally, the field of high-dimensional statistics attempts to address theoretical and computational problems associated with datasets in which the number of covariates (in our case this may refer to chromosomal locations or genes) is comparable to or greater than the number of samples available. In these settings results such as the central limit theorem that rely on divergence of the sample size independent of the dimensionality are often not of much use . This is often the case in cancer genomics.
Recent decades have seen much excitement around the application of machine learning methods to a wide variety of high-dimensional problems. Particular progress has been made in automated image recognition and natural language processing (NLP). This progress has come via the development of specialised techniques to exploit the
It is hoped that similar strides forward can be anticipated in biology, but it is important to acknowledge the current gap in data availability between cancer genomics and the other machine learning disciplines mentioned above. In the next section we will discuss typical types of biological data encountered in cancer genomics (including sequencing-based omics technologics that may not strictly be genomics, such as gene expression profiling), their dimensionality and typical availability. While efforts to deploy machine learning architectures are certainly producing results in some cases [13, 14], an important takeaway is that in many cases, we are not yet in a situation where the data-heavy deep learning approaches that have revolutionised image recognition will be applicable to cancer genomics problems.
That is not to say that we cannot do anything! In fact, it is often instructive to try and make headway in situations where a ‘data-heavy, structure-light’ approach is unsuitable, and these sorts of investigations can have a profound impact on the design of more sophisticated models . As a final point, readers approaching without a significant backlog of machine learning expertise will find that an understanding of statistical terminology will aid comprehension of the machine learning literature which has them as its basis.
2. Omics and biological data
2.1 DNA sequencing
Cancer genomics is underpinned by the ability to sequence DNA cheaply and quickly. DNA is organised into chromosomes, along each of which many genes are arranged, with further non-coding regions interspersed in-between. The fundamental units of DNA are nucleotide bases, of which there are four varieties (labelled C, G, T and A). These are organised in groups of length three called codons, which code for the production amino acids. Codons are arranged in sequences such that their amino acids when joined in chain form proteins—the products of genes.
The aim of sequencing is to read, base by base, the information content of DNA. This was originally done by Sanger sequencing, a procedure to infer the base composition of a piece of DNA one base at a time. High-throughput sequencing automates this process via the following workflow:
DNA is isolated from a sample and amplified (replicated many times) to ensure good signal.
Purified DNA is broken into many pieces of manageable length.
These short strands are sequenced individually and simultaneously by an automated process similar to Sanger sequencing.
These short sequences are matched to a reference human genome to identify where the DNA in the original sample differed from that reference.
2.1.1 Tumour/normal variants
In cancer, some subset of cells accumulate mutations, via random misreplication of DNA during cell division or exposure to some external mutagen (e.g. cigarette smoke, UV light). Tumour cells therefore contain DNA with a different sequence to that of the patients’ typical sequence. To understand this two samples are collected, one from the tumour and one from normal tissue, and both are sequenced. The sequences are compared and this produces a list of locations at which mutations have occured: these mutations can have a variety of types (replacements, insertions, etc.) and can have vastly differing functional implications.
In simplest setting, we could express a tumour’s mutational profile as a vector, with each component corresponding to whether the tumour-derived and normal sequences match at that point. How long would this vector be? The human genome contains approximately base locations. This is the dimensionality (which we will refer to later on as ) of naively presented genomic data. We often like to compare the dimensionality of a dataset with the number of samples (which we will later call ) to which we can expect to have access. In this case, unless we have access to tumour profiling for more than a third of all humans on the planet, we can never hope that these numbers will be comparable. We could make a small gain by listing all codons in the genome, labelling a component as one if the codon has been functionally altered by mutations and zero otherwise. Here though we would still have .
We could simplify our data further. Decades of biological research has focused on cataloguing the locations of genes across the genome. We might consider as covariates each of the (approximately ) genes, and represent each sample as a vector where each component refers to (a) whether or not the gene contained a functional mutation; (b) how many such mutations were present; or (c) some other representation of the severity of collective mutations presents in the gene, drawing upon known biology. It is important to appreciate the trade-off we have made here: we have imposed an external notion of structure onto our data and in return have greatly reduced the dimensionality (by five orders of magnitude), but in exchange have lost resolution and thus potential information. This gain/sacrifice will be reflected when we choose to make even further structural assumptions in order to construct sensible models.
2.1.2 Heterogeneity and depth
Another important concern for those dealing with cancer genome data is that tumours are often highly heterogeneous. Different sub-populations of cells have different mutation profiles, which fit into an evolutionary hierarchy within the tumour’s history. The importance of understanding the role of heterogeneity is beginning to be appreciated in a clinical context, and this has implications for the type of data that are used. In the context of the high-throughput sequencing pipeline, the relevant quantity is depth: identifying not just one but a variety of tumour sequences at a genomic locus along with the proportion in which they occur means thinking very hard about how best to express that data.
2.2 Gene expression
It is often not just the sequence of a gene which it relevant in a tumour, but the level of gene expression. The way that this is most often estimated is via the proxy of RNA transcript abundance: RNA is a similar molecule to DNA that is produced during the process of DNA being ‘read’, and acts as a messenger for sequences that should be converted to protein. Abundances of different RNA transcripts can be measured using procedures based on DNA sequencing. This will in general give data with the same dimensionality as gene-based mutation data, but is of a different type. Measured values are continuous to represent concentrations of gene products, rather than discrete ‘mutated/not mutated’ values. This has implications as to the sort of structural assumptions we can make about the data that we observe, and the models that will be best suited to capitalise on that structure.
3. Dimensionality and structure in statistical learning theory
Now familiar with the most relevant biological concepts, we turn to the mathematical theory of high-dimensional statistics, which has experienced a surge of interest in the last two decades. This is the language with which we will be attempting to interrogate issues of inference and prediction in cancer genomics. Informally, we may think of high-dimensional statistics to be concerned with the realm in which the dimensionality of our input data, , is comparable to or greater than the number of training samples we have available. In this regime the classical asymptotic theory of statistics, which generally relies on an assumption of fixed dimension and considers limiting behaviour as , may fail to apply. Classical results such as the law of large numbers and central limit theorem are not applicable.
3.1 What is high-dimensional statistics?
We often consider a very generic setup, in which we have paired data . We model each of these pairs as being drawn from a joint probability distribution , which gives the probability of observing any combination of observation and label . For now we make no assumptions about the nature of the labels: they may be continuous values (regression), discrete values (classification) or more complicated objects such as is the case in survival analysis. We assume that for each , so that our observed values are vectors of length and each element is a real number (possibly restricted to some subset such as the positive reals—this is what specifies). We refer to as the dimension and as the sample size of our data. We wish to fit some model to the data. This could be in order to make some inference about the parameters of the distribution , which will hopefuly shed light on the effect of each of the covariates contained in an observation . Alternatively, we might be trying to predict future values of from unlabelled observations as accurately as possible. These two aims are often distinguished by the umbrella terms statistical inference and statistical learning.
In many statistical models we have a vector of parameters with at least the same dimension as our data (). In generalised linear models (GLMs) the likelihood of an observation depends upon the data solely via the inner product , so that each component of corresponds to the relative importance of its associated covariate. Classically, we would attempt to to estimate the parameter via our observation through a procedure such as likelihood maximisation. However, it is clear in this context that if is comparable to or larger than then we have very little chance of accurately inferring the parameter vector . For example, we canot expect to simultaneously learn about the effect of 20 covariates if we only have 10 observations: we say here that the model is unidentifiable.
High-dimensional statistics attempts to gauge what we can do in regimes such as these. One is approach is to assume the data has some low-dimensional structure. This means that we can embed our data in a lower dimensional space such that the smaller representation of our data contains all or most of the necessary information about the joint distribution . We will discuss some common structural assumptions. The simplest and most interpretable is sparsity.
Given a vector parameterising a model, we say is -sparse, for , if at most elements of are non-zero, that is
We can say a model parameterised by a vector is -sparse if the vector is -sparse.
Sparsity is a useful assumption to make for a variety of reasons. We are reducing the number of parameters that we must estimate—for a -sparse model, we need only estimate parameters. Before we do so we need to decide which parameters are allowed to be non-zero, that is, to which -dimensional subspace (out of choices) our parameter belongs. In practice this is not a huge issue—some powerful theory from the field of convex optimisation allows for efficient training of sparse models (see the LASSO estimator below). Finally, sparse models are interpretable. A small number of covariates selected for importance can be useful in hypothesis refinement.
3.1.1 Sparse data vs. sparse models
It is worth at this point drawing a distinction between two phenomena in statistics and data science both referred to as ‘sparsity’, both of which are exhibited in cancer genomics. The first is sparse
This notion that there is some sparse representation of data but that it may not translate directly to a subset of our covariates motivates the more general principle of Sufficient Dimension Reduction (SDR). Sparsity restricts our attention to some small subspace of the covariate space . More generally, we may insist on some important smaller subspace, but one that does not depending on a specific representation of our data . The definition of SDR is somewhat more technical, so those without mathematical background may find it easier to skip.
Given drawn from probability distribution , we say there exists a sufficient dimension reduction of size if there exists some function with such that is conditionally independent of given , that is,
For an observation , the image is a -dimensional representation of . As a special case we have linear sufficient dimensional reduction if the function is a linear projection .
Picking apart this definition, conditional independence means that only depends on through some low-dimensional image. Note that, in contrast to sparsity, we have not made reference to a linear model parameter . In fact, in the context of a generalised linear model where depends on only through some function of , we can simply take and see that admits a sufficient dimensionality condition with = 1. SDR, therefore, is a helpful notion in settings in which we need to apply a non-linear model structure. Methods based on finding sufficient dimension reduction projections by searching through spaces of projections  in combination with non-linear base classifiers are beginning to show promise in a variety of domains including the analysis of high-dimensional medical data .
3.1.2 Techniques in high-dimensional statistics: Selection and regularisation
It is all very well imposing assumptions of low-dimensional structure onto our data. How can we now exploit this to produce models that reflect the structural assumptions we have made? One answer is regularisation. Regularisation refers to some penalisation process being applied to the parameters of our model. The intuition is that, given some model parameter of size greater than or equal to the dimension of our data, and thus of comparable magnitude to our number of samples, we have enough degrees of freedom when fitting the model that we can be guaranteed to produce almost perfect training set results without having done anything more than memorise our data. Therefore we must place restrictions on our parameter, and the trick is to do this as part of the model fitting process by combining a regularisation term to the loss function of our learning procedure (ideally in such a way as to preserve what is known as loss convexity, which allows efficient model fitting).
Regularisation is applied in practice across a whole range of model types, but is easiest to understand in the context of linear regression, so in the discussion that follows we will restrict ourselves to this setting.
In linear regression we have a model , parameterised by , given by
for some noise . We are saying that can be approximated by a linear combination of the components of , with the relative weightings of each component given by the components of . The loss of our model (a measure of how inaccurately it is predicting across all our data) is given by
In general we choose to minimise this loss for an optimal model, but suppose we wish to find an optimal -sparse model, that is one for which is -sparse. Rather than minimising over all possible choices of , we are minimising the loss over all values of that are also -sparse:
Here we face a computational difficulty: we have to separately check each subset of covariates of size and minimise on that set of possible parameters, then compare them all to find the best. What we do to circumvent this is include a penalisation term for , which encourages sparsity alongside the loss function in our optimisation. An obvious choice would be the L0 ‘norm’, , which counts non-zero coefficients. In practice this is not computationally feasible (to be technical, the problem is non-convex and so NP-hard), so instead we use the the L1 norm given by . While this does not explicitly encode sparsity, it turns out that in practice it does produce sparse solutions. This process of replacing a non-convex problem with an easier one is in general called convex relaxation.
where is a positive number chosen to specify how strongly we want to encourage sparsity: different values of will produce different s in the ouput. A particularly attractive feature of the LASSO selector is that it acts simultaneously as a variable selection and model fitting procedure.
To take stock, we have begun with an assumption that some small subset of our covariates are important in predicting the response . This assumption might have come from necessity due to data availability, from knowledge of the biological system we are modelling, or from both. We will discuss these possibilities in more depth in the next chapter. We have taken a simple model, and altered it to express this structure, and have done so in a way that is computationally feasible.
The specific form of the regularisation we employ can have very subtle effects on the traits it encourages in models, which should motivate us to be very careful when translating the biological knowledge we want to express into our learning systems. For example, adding an identical regularisation term but replacing the L1 norm with the L2 norm () does not produce sparse models, but rather models that do not contain large coefficients. The corresponding structural assumption for this is slightly more technical (we can assert a multivariate Gaussian prior on the parameter space for ). This can be applied in a wide variety of high-dimensional situations, often alongside other forms of regularisation, as a combatant to over-fitting (typically via cross-validation).
where again is a positive value that can be selected by cross-validation to reduce overfitting.
Figure 1 describes the workflow of modelling high-dimensional data. The data dimensionality, as discussed in the previous chapter, is the underlying problem, which we address with structural assumptions informed from a mixture of external knowledge and practicality, which are then transformed into a feasible computational problem. Intuition around the biological and also statistical context are applied at each step.
For those unsatisfied with the abstract nature of the discussion above, we now attempt to provide more concrete examples.
4. Cancer genomics questions in the language of high-dimensional statistics
4.1 Biomarker/driver gene identification
We have discussed some of the terminology associated with high-dimensional statistics. We can now express some cancer genomics questions in the same language. We have data with a very high dimensionality : bases, codons or genes (, and respectively) and we would like to predict some outcome, be it a survival value, biomarker signature or other phenotype. Due to the resources and time required to perform whole genome or exome sequencing we often face restrictions in the number of samples at our disposal. The popular Cancer Genome Atlas resource , for example, contains sequencing data for around 20,000 tumour/normal matched samples. Even if all of these samples were relevant to our study, and we were trying to predict some phenotype using gene-level data, we would be working in the regime. If we were using codon or nucleotide level information, we would be well into the regime. In the following we will assume we are working with some gene-level covariates, and investigate what sort of structural assumptions we may wish to make in order to fit tractable and robust models.
4.2 Sparsity by assumption: driver genes
Driver genes in the simplest sense are genes that, when mutated, will elevate risk of the development, progression or adaptation of a tumour . They may be grouped roughly into oncogenes and tumour suppressors: oncogenes admit mutations giving some selective advantage to a cancer cell, while tumour suppressors in their standard form protect against aberrant cell growth or apoptosis evasion. Identifying driver genes (or driver sites within genes) among the extensive backdrop mutation in tumours is notoriously difficult. Selection pressures produce subtle and often non-obvious patterns of mutation density between neutral and non-neutral genes as well as distinct signatures for oncogenes and tumour suppressors . Neglecting these difficulties for now, suppose we wish to infer some phenotype (again for simplicity we assume that this is continuous and single-valued). We do not have nearly enough data to fully explore the dependence of on all genes simultaneously—we have to assume that there are
4.3 Sparsity by necessity: gene panels for genome-wide biomarkers
Another justification for selecting some small set of genes/genomic loci to include in an investigative panel is that the cost and time to perform sequencing depends (approximately linearly) on the size of the subsection of the genome to be sequenced, and the depth at which it is sequenced. This means that in many practical or clinical environments, cost is a major factor. While the cost of whole genome sequencing has decreased at an impressive rate, it is far from being standard of care for cancer patients. It is therefore important that gene-panel style biomarkers are as small as possible, while maintaining enough accuracy that clinicians feel confident in acting upon predictions. This is a particular issue for genome-wide biomarkers, which have gained popularity in recent years, for example in cancer immunotherapy. Examples include tumour mutation burden  and indel burden , which report density of somatic mutation across the entire cancer genome. In this case all regions of the genome are relevant to greater or lesser extent (Figure 2)—the optimal panel for prediction would be the entire genome (or exome, depending on the specific biomarker). However, certain genes may be particularly relevant, for example by taking an active role in DNA repair mechanisms. When estimating such biomarkers, we therefore want to offset the positive predictive contributions of individual genes/loci against the added cost burden given by inclusion in the panel. Analyses of the impact of panel size on predictive power in theoretical and practical settings are becoming more common .
Suppose we have some set of genes, where refers to an individual gene with coding sequence of length . Now let refer to a gene panel comprising a set of genes, and be a model trained on some data with covariates included according to the gene panel . Then we might wish to solve the optimisation problem
where is the loss of the model , is the total length of the gene panel and is some prescribed maximum panel length. Note the similarity with the LASSO setup described in Section 3.1. In the case of a linear model we can similarly reformulate the problem in terms of the parameter , and solve the analogous problem.
where we have again swapped the panel length bound for the regularisation parameter . Since all the values are positive, this is still a convex optimisation problem and thus can be solved efficiency as in the standard case. Choice of is less likely to be chosen via cross-fitting, as smaller values of will always improve predictive power. Instead will be chosen to control the size of the resulting gene panel.
4.3.1 Distinguishing causative mutations
It should again be noted that these are illustrations of how high-dimensional model construction is done. In reality many more subtleties may have to be taken into account. In the above a key caveat requiring understanding is the role of selective pressure in cancer-relevant genes , and how this affects the mutation rate in different sections of the genome . One way this can be investigated is by looking at the relative predictive power of synonymous and non-synonymous mutations for genome-wide mutation burden . The gold standard for identifying causative relationships between genotype and phenotype, however, remains with functional validation studies.
4.4 Survival prediction
No review of statistical learning in cancer genomics would be complete without a mention of survival prediction. Survival prediction is useful in a variety of situations, far beyond direct prognostic application. Hazard regression models based on genomic data have been useful in identifying therapeutic resistance  or general prognosis [29, 30] factors, which are of great interest to those developing drugs or attempting to understand which patients can expect to benefit from them. Regularisation-based techniques are perfectly adaptable to proportional-hazards style models , to which end there has much literature beyond what we have scope to discuss in this chapter.
5. Modern techniques in high-dimensional statistics and dimensionality reduction
We conclude with some examples from recent literature of techniques related to dimensionality reduction in modelling genomic data. The examples have been chosen to demonstrate the structure/regularisation workflow discussed in this chapter, and are small a set of examples rather than (anywhere near) an exhaustive list.
5.1 Regularised graphical models
In the regression examples discussed previously, the parameters of interest have represented the weighted effect of observed covariates on a label. In supervised and unsupervised cases, we are also often interesting in looking at how closely related different covariates are, through estimating the correlation matrix of the observation variable . If we have an observation of dimensionality , then the covariance matrix will be of size , so problems of estimation from small are even more confounded!
Two forms of regularisation are popular, often used in tandem. The first is a sparsity penalty applied to all matrix entries . What does this correspond to structurally? It means that that most pairs of covariates are independent (or at least uncorrelated). This is a very relevant notion in network analysis, where variables are thought to affect each other in a way that can be described by some graphical structure. Sparsity of matrix elements then corresponds to sparsity of the graph describing the network. It is also not uncommon to sparsely penalise precision, defined by the components of the inverse covariance matrix .
Alternately (or in addition), we may wish to limit the number of distinct
5.2 Localised sparsity assumptions
We have made an extensive discussion of sparse models in this chapter. We might wonder if there are any generalisations to the assumption that relatively few of our covariates are important throughout all of our samples. One such generalisation would be that for some subsets of our samples sparsity assumptions hold, but that the important covariates may differ from subset to subset within our data. In a localised sparsity setting, we are often given some knowledge of the organisational structure of data, either in a discrete way through a prior partition of the samples or network structure, or in a continuous way through a measure of distance between samples (which may come directly from the input data). We can then fit linear models that are regularised towards sparsity, but where variable selection is allowed to vary between samples, and allowed to vary more between samples that are more distant. This has been applied to the prediction of drug toxicity based on differential gene expression data .
5.3 Variational autoencoders
For our final example we consider a notion of dimensionality reduction that is more general and that has been studied extensively in the machine learning literature. This nicely elucidates the grey border between statistical and machine learning, and the difficulties and opportunities available to biological research by embracing the latter.
Variational autoencoders (VAEs) are a class of neural networks with a variety of architectures and sizes, but whose premise centres around producing an encoding/decoding framework between high-dimensional data and a lower dimensional representation . VAEs have an ‘hourglass’ shape: input data is fed into the network, and information is propagated through layers of progressively smaller size until a bottleneck is reached. The central layer will have some small number of latent nodes. Subsequent layers increase in size, reaching an output of dimension matching the input. VAEs are trained to reproduce the inputs with which they are trained as accurately as possible. We can then view the central latent nodes as an encoding of our input data . This might (a) contain some insightful information and (b) be useful as lower dimensional input data for training other models.
In the context of cancer genomics , VAEs pose two challenges, illustrative of those that machine learning procedures in general must overcome to be useful in a basic research or clinical setting. Firstly, they are highly parameterised compared to the types of model discussed so far. We have discussed at length the balance between data availability and model size, and the significant extra effort necessary to extract information when information is scarce. One of the advantages of deep learning procedures is their versatility and lack of dependence on prior knowledge and assumptions of structure. The cost is that they are very data intensive, prohibitively so in some cases. Secondly, while a VAE’s latent nodes may be informative within a network, there is no necessary guarantee that they will be interpretable by a human, nor that biologically relevant features will have been neatly allocated to a single node. Strategies to ‘untangle’ VAEs are necessary to make biologically relevant predictions .
The dimensionality of data in genomics is a sticking point that at its full potency is more debilitating than in any other research discipline . Even at the current pace of increase of the availability of sequencing data, it will be a long time away (if ever) that the most powerful and general machine learning techniques will be at our disposal without recourse to the vast wealth of biological knowledge we as a species have accumulated. To properly use that knowledge, we need researchers who are able to speak the language of both camps. It is not sufficient that researchers in cancer genomics provide data and questions to researchers in machine learning, nor that machine learning researchers communicate back the output of their methods. Instead, methods need to be crafted bespokely by those who understand what features of cancer data are relevant, how those features manifest themselves and how to exploit them in a mathematically consistent way.
This entire workflow is quite easy to follow when the sort of structure we are insisting upon in our models is very simple. Even when a structural assumption can be motivated in a single sentence (see Definition 3.1), and a model is simple (such as in linear regression), a good design of learning procedure might not be immediately obvious. It can likely, however, be given a fairly ground-up description within a single book chapter. When the structural assumptions we really want to incorporate might well extend as far as our current appreciation of the mutational processes affecting tumours across heterogenuous cell populations, chromosomes, genes and codons, and the models we want to fit are similarly at the cutting edge of computational research, then the position of an interdisciplinary researcher may well require far more legwork to maintain.
As motivation for the above legwork, it should go without saying that cancer genomics in the machine learning age has potential to do a great deal of good in the long term. Yet uncovering a deeper understanding of how cancer works is not the only worthwhile goal. Designing procedures that can work
Many thanks to Timothy Cannings and Belle Taylor for their support and advice, to John Cassidy for suggestions of improvements, to Steven Bradley for proofreading and providing a non-technical reader’s viewpoint, and to Morton for his invaluable contributions.
Conflict of interest
The author declares no conflict of interests.