For more than a decade, machine learning (ML) and deep learning (DL) techniques have been a mainstay in the toolset for the analysis of large amounts of weakly correlated or high-dimensional data. As new technologies for detecting and measuring biochemical markers from bodily fluid samples (e.g., microfluidics and labs-on-a-chip) revolutionise the industry of diagnostics and precision medicine, the heterogeneity and complexity of the acquired data present a growing challenge to their interpretation and usage. In this chapter, we attempt to review the state of ML and DL fields as applied to the analysis of liquid biopsy data and summarise the available corpus of techniques and methodologies.
- machine learning
- deep learning
- data analysis
- biomarker detection
- automated discovery
- literature review
Biological and medical sciences are becoming increasingly data-rich and information-intensive. This tendency, along with the growing availability of such data, provides a better understanding of important questions regarding functions of organisms, causes of diseases, etc. However, both the inherently massive complexity of biological systems and the high dimensionality and noisiness of data thus acquired can make it remarkably difficult to correctly infer such mechanisms. Machine learning (ML) and deep learning (DL) techniques are quickly becoming highly useful tools for solving difficult problems in biology and medicine by providing mathematical apparatus for analysing vast amounts of information that would otherwise be difficult to process and interpret. Additionally, these fields themselves provide new challenges for machine learning that can ultimately advance existing ML techniques and give rise to new ones.
The mutual history of machine learning and biological and medical disciplines is both long and complex. An early ML technique, the perceptron, was made in attempt to model the behaviour of biological neurons  and was used early on to define the start sites of translation initiation sequences in
In this chapter, we first provide an overview of the commonly used ML and DL techniques and strategies and outline their broad areas of applicability with regard to processing and analysis of biological and medical data. Next, we attempt to summarise the available corpus of research and development concerning the application of ML and DL techniques to the process of analysis and interpretation of biomedical data, focusing on liquid biopsy analysis, outline several of the main avenues of such research, and predict the potential improvements and changes in this highly dynamic and quickly developing field. Expertise in ML is not a prerequisite for this chapter, although we assume basic overall familiarity with the most well-known ML and DL models, techniques, and methodologies.
2. Machine learning strategies
This overview is limited to classical software-based tools and techniques for brevity’s sake. Several hardware-based approaches are mentioned in the Future Prospects section.
The ML ecosystem is both extensive and complex [3, 4, 5], with many possible ways to subdivide or classify its members. One frequently used classification scheme outlines two broad groups of ML algorithms: supervised learning, where the model is presented with both a set of labelled example inputs and desired outputs (called the training dataset), with the goal to learn a mapping from inputs to outputs, and unsupervised learning, where no labels are given to the model, leaving it to learn the input-output mapping in unstructured data. A notable specific case of supervised learning is reinforcement learning (RL), where training data consists only of positive (“reward”) and negative (“punishment”) feedback, given according to the model’s performance in the training environment.
Another informative approach to classifying ML algorithms is based on the desired type of output of the given model, such as classification (division of the input data into two (binary classification) or more (multi-label classification) predetermined groups), clustering (similar to classification but with the groups not known beforehand), dimensionality reduction (simplification of high-dimensional input data by mapping them into a lower-dimensional space), search, etc. Of these, clustering is particularly notable due to its broad and general applicability and the wide range of models, methods, and algorithms [6, 7, 8] that can be employed to carry out cluster analysis.
The notion of “cluster” is often not precisely defined and tends to serve as an umbrella term for various types of data objects, typically groups of data points with small distances (appropriately defined) between group members, higher-density areas of some parameter space, particular statistical distributions, etc. The desired clustering algorithm, therefore, depends on both the given data set and the intended application of the returned results. Due to these complications, clustering, like many other data analysis methods, is typically not fully automated even within the domain of machine learning but instead tends to partially rely on preprocessing and initial parameter selection, based on the specifications of the task at hand.
Deep learning is a subclass of machine learning problems, the distinction being based on the training data representations instead of specific algorithms. Similar to ML in general, deep learning can be both supervised and unsupervised . Deep learning models tend to be vaguely similar to information processing patterns in biological brains (and are therefore often called artificial neural networks), in that they use multiple layers  of non-linear processing units (frequently called “neurons”, even though their similarity to biological neurons is usually limited) for pattern recognition and transformation, with each successive layer using as inputs the output from a previous layer, forming a hierarchy of representations and levels of abstraction. The number of hidden layers of an artificial neural network broadly determines the “computational power” of the network  (Figure 1).
Machine learning models have been applied to a wide variety of fields and problem classes, including computer vision, natural language processing, machine translation, bioinformatics and biochemistry , with results often similar or superior  to those produced by human domain experts.
3. Using machine learning techniques in blood tests
3.1. Classifying blood cells with deep convolutional neural networks
An important part of the data acquired by blood tests is the number of white blood cells (WBCs) or leukocytes, usually differentiated into total and differential WBC count, where the latter describes the absolute and relative numbers of WBC subtypes (neutrophils, lymphocytes, basophils, eosinophils, and monocytes) in the sample. The amount of WBCs in the sample provides information on the state of the patient’s innate and adaptive immune system, e.g., a significant changes in the WBC count relative to the patient’s baseline is evidence that their body is being affected by an antigen, whereas variations in the specific WBC subtypes can correlate with specific types of antigens or different pathways of immune and inflammatory reaction. Therefore, detailed measurement and understanding of the WBC counts is an important part of the quantitative picture of health and the organism’s general condition.
Traditional methods of estimating the WBC count generally fall into one of two categories—manual and automated. The historical manual inspection of the blood sample involved counting the number of cells in a blood sample under a microscope and extrapolating under the assumption of uniform cell distribution across the entire bloodstream. Automated methods involve specialised equipment such as Coulter counters  or laser flow cytometers  which can provide accurate results and good performance  but are generally expensive and require specialised training to operate.
In this light, the ML-based approach provides a potential improvement over the aforementioned techniques due to several reasons. First, it requires far less expensive equipment due to being built around simple imaging solutions. Furthermore, unlike earlier methods, it is able to provide almost instantaneous results after the initial training stage. Finally, its performance can be expected to improve over time, in proportion to growing dataset sizes and, being mostly software-based, it can be expanded and advanced continually and “over the air”, without requiring extensive changes in the underlying infrastructure.
We illustrate this approach using an example problem provided by
For the ML model,
3.2. Using deep neural networks for detection of ageing-related biomarkers
During the last decade, human ageing research has received an increasing amount of mainstream interdisciplinary attention [19, 20], with an emerging tendency to approach various aspects of the natural ageing process as potentially treatable conditions.
An ensemble of 21 feed-forward deep neural networks (DNNs) was created as the ML model, with a range of values assigned to DNN parameters such as the number of hidden layers, the number of processing units per layer, activation function, and optimization and regularisation methods. The permutation feature importance method  was used to evaluate the relative importance of the various biochemical markers with regard to ensemble accuracy. Batch normalisation  was used to reduce the effects of overfitting and increase the stability of convergence of the models.
The best results were obtained from a DNN with five hidden layers, using regularised mean squared error (MSE) function as the loss function, parametric rectified linear unit (PReLU)  activation function in each layer, and AdaGrad  optimiser of the loss function. The highest-scoring DNN performed with 82% ε-prediction accuracy at (i.e., considering the sample as correctly recognised if the predicted age is ±10 years of the true age), out-performing several classes of competing ML models. Multiple models for combining individual DNNs into an ensemble (stacking) were evaluated, with the best being the elastic net model . The most important blood markers were discovered to be albumin, glucose, alkaline phosphatase, urea, and erythrocyte count.
3.3. Machine learning-based approach to Alzheimer disease biomarker discovery
In their study,
A sample set of was used to construct the training dataset consisting of normalised miRNA expression data across 465 loci. Cross-validation was used in feature selection to evaluate the impact of values from specific loci as features. The samples were randomly divided into 7 partitions of 5 positive and 5 negative samples each and cross-validation was performed on these partitions, using 6 partitions for training and 1 for evaluation. The random partitioning was repeated 10 times in order to acquire 70 estimate points of the performance measures of interest, one for each sample in the set. These values were averaged and their relative performance was assessed using area under the curve of the receiver operating curve (ROC), Matthews correlation coefficient (MCC) , and F1 score.
The best model used by
3.4. Detection and classification of circulating tumour cells using machine learning methods
The presence of circulating tumour cells (CTCs) in blood samples indicates the tumour response to chemotherapeutic drugs and contributes to the mechanism for subsequent growth of derived tumours (metastasisation) in distant tissues. Evaluation of CTCs can yield the diagnosis or help to follow the tumour response to chemotherapeutic drugs.
The CNN received normalised pixel images as input. They were passed to a layer of 6 convolutional filters with the size of , followed by a max-pooling layer in order to extract the local signal in every pixel region, defined by the max-pooling function,
where —pixel coordinates, —input map, —output map. This layer was followed by another convolutional filter layer, consisting of 12 filters, and, subsequently, by another max-pooling layer. The last layer was fully connected to the output layer by way of dot product between the weight and input vectors, passed to the sigmoid function which maps the values to the range. The filter parameters, network bias terms, and weight matrices were automatically adjusted by backpropagation with learning rate set to .
4. Cancer detection and monitoring using neural network-based methods
4.1. Using artificial neural networks for lung cancer detection and diagnosis
4.2. Mutation prediction and early lung cancer detection in liquid biopsy using convolutional neural networks
The proliferation of cancer cells is driven by specific somatic mutations in the cancer genome . To fulfil the high expectations associated with liquid biopsy, such as comprehensive characteristics of the whole tumour in contrast to limited sampling in the traditional tissue biopsy, or dynamic assessment during treatment, the somatic mutations must be detected with high sensitivity and accuracy; limited coverage depth is not sufficient.
For training dataset, whole genome sequencing (WGS) data from 4 non-small cell lung cancer (NSCLC) patients and 3 melanoma patients were used, with reads in total. To ensure adequate genetic context regardless of variants appearing at the end of the read, additional bases were added to both ends of the read. Additional bases were also added to ensure equal read length in cases where a read is shorter than 150 bp.
The model was trained using minibatch stochastic gradient decent (SGD) with batch size of 256, initial learning rate of 0.1, and momentum of 0.9, with batch normalisation  and a rectified linear unit (RLU)  applied after each convolutional layer.
4.3. Machine learning and nanofluidics in pancreatic cancer diagnosis
Using the Exosome Track-Etched Magnetic Nanopore (ExoTENPO) nanofluidics chip developed as part of the study,
Training datasets of 15 mouse and 10 patient profiles, respectively, were created. Linear discriminant analysis (LDA)  was used to identify combinations of mRNA profile that discriminated between healthy and tumour-bearing samples. The prediction algorithm was generated by running LDA on the training set, which produced a vector that was used to calculate a weighted sum such that it maximally separates the control group from the sample group with tumours. Two independent blinded test sets, mouse () and patient (), respectively, were used to evaluate the performance of the LDA classifier. Fisher’s exact test was used to quantify the predictive value of the classifier, yielding .
Although in their study
4.4. Machine learning-based RNA sequencing for multi-class cancer diagnostics
The initial dataset consisted of blood platelet samples from healthy donors () and both treated and untreated patients with six different tumour types (NSCLC, colorectal cancer, glioblastoma, pancreatic cancer, hepatobiliary cancer, and breast cancer) in various stages of advancement and metastasis (). After the mRNA extraction, amplification, and sequencing, a set of approximately 5000 different mRNAs was selected for further analysis.
The accuracy of TEP-based multi-class cancer classification in the training dataset () was estimated, using an SVM algorithm. To cross-validate the SVM for the entire sample set, leave-one-out cross-validation (LOOCV) method was applied. The percentage of correct predictions was reported as the accuracy score. The algorithm was performed 175 times, in order to classify and cross-validate the entire dataset. To determine specific input gene lists for the algorithm,
5. Using machine learning to accelerate DNA sequencing and biomarker development
5.1. A supervised machine learning-based approach to DNA sequence analysis
DNA sequencing and sequence analysis is an important task in many scientific and medical fields that is well-known for being both data-rich and computationally intensive.
A boosted decision tree regression-based predictor  was used to estimate the execution time of DNA sequence analysis for both the host CPU and the Intel Xeon Phi co-processor. The predictor’s output was used to partition the DNA sequence based on the S-factor,
where and are execution times for the host CPU and the co-processor, respectively, and using the partitioning scheme
where is the original DNA sequence, is the part of analysed by the host CPU, and is the part of analysed by the co-processor.
6. Future prospects
While the ML models currently used in liquid biopsy analysis in particular and biological and medical research in general (typically different classes of neural networks and linear classifiers) appear to both produce accurate results and show generally high performance, they represent only a narrow subset of machine learning and artificial intelligence solutions . For instance, a potentially valuable research direction might be in the form of highly advanced probabilistic graphical models  augmented with functionality such as one-shot learning  and probabilistic program synthesis , which could potentially allow researchers to reduce the size of the commonly massive training datasets required for creating ANN- or DL-based models.
Furthermore, with a single exception, all of the studies reviewed here have been focused on the performance and accuracy of software ML models, which is currently the predominant class of machine learning solutions. However, recent advances in general purpose computation using both graphics processing units (GPUs) and specialised application-specific integrated circuits (ASICs) tailor-made for machine learning  provide a strong case for the exploration and exploitation of hardware or hybrid ML solutions, as evidenced by, e.g., the results from the AlphaGo experiments and public performance .
Liquid biopsy-based approaches open many so far little explored and promising opportunities for studying and measuring biological and biochemical markers with broad applications for the monitoring, diagnosis, and prognosis of a large class of diseases and processes. Machine learning, with its advanced pattern recognition capabilities, will likely play an increasingly important role in these fields, as the amount and complexity of data produced by scientific and medical sources already by far exceeds the capacity of unaided human experts and is rapidly increasing with no foreseeable slowdown.
In addition, machine learning tools form a natural synergy with distributed, highly parallel, or cloud-based computation solutions, thus easily yielding to collaboration among researchers and medical professionals from distant locations and involving amounts of data storage and processing power previously available only on dedicated high performance computing (HPC) platforms and supercomputers. It is likely that in the near future the importance of decentralised collaboration will continue to grow, increasing the demand for powerful and easy to use toolset for analysis and processing of biological data.
Based on these trends, we expect that the next generation of liquid biopsy technologies will include many types of machine learning as an integral part of their operation and that this trend could have a significant positive impact on both diagnosis and treatment of patients.
The present work was carried out within the frame of scientific project № 126.96.36.199 VIAA 1 16 242.
Conflict of interest
The authors declare that the chapter was written in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
- Eosinophils, basophils, and neutrophils are polymorphonuclear, while lymphocytes and monocytes are mononuclear.
- The naming scheme represents the number of “neurons” in the input, hidden, and output layers of the MLP model, respectively.
- A CNN architecture developed by the Visual Geometry Group at University of Oxford.