Approaches for Handling Immunopathological and Clinical Data Using Deep Learning Methodology: Multiplex IHC/IF Data as a Paradigm

Recent advancements in deep learning based artificial intelligence have enabled us to analyse complex data in order to provide patients with improved cancer prognosis, which is an important goal in precision health medicine. In this chapter, we would be discussing how deep learning could be applied to clinical data and immunopathological images to accurately determine survival rate prediction for patients. Multiplex immunohistochemistry/immunofluorescence (mIHC/IF) is a relatively new technology for simultaneous detection of multiple specific proteins from a single tissue section. To adopt deep learning, we collected and pre-processed the clinical and mIHC/IF data from a group of patients into three branches of data. These data were subsequently used to train and validate a neural network. The specific process and our recommendations will be further discussed in this chapter. We believe that our work will help the community to better handle their data for AI implementation while improving its performance and accuracy.


Introduction
Improved cancer prognosis is a vital goal of precision health medicine. Advancements in Deep Learning (DL) based Artificial Intelligence (AI) technologies enable modelling of complex data providing deeper insights and patients with more reliable results. Machine Learning (ML) is the process of enabling machines to make predictions from data that is fed into it. This includes DL, a type of approach created from the development of artificial neural networks [1]. The DL network consists of multiple layers of artificial neural networks including an input, an output and multiple hidden layers [2,3]. Predictions are made after datasets are generated from and trained against these hidden layers. Recent advancements in computational processing power has sparked interest in tapping into the vast research on DL and applying it to digital pathology. Digital pathology is the process of digitizing Whole-Slide Images (WSI) using advanced slide-scanning techniques and AI-based methods for detecting, segmenting, diagnosing and analysing digitized images [4].
DL is the engine of advancement in artificial learning in both computer and clinical sciences. It is a collection of computer learning algorithm layers that uses the raw data input to first generate generalise features that are subsequently used to progressively extract higher level features such as tumour stroma count, and assign them class labels. Eventually, the system will distinguish different interest categories via the identified ideal data features. The DL approaches are widely accepted due to the ability of discovering patterns and signals from data too large for human comprehension. Furthermore, the multiple layers allow modelling of highly complex non-linear problems. On top of having higher accuracy, the DL approaches are also easily applied.

The importance of deep learning in digital pathology and mIHC
In current clinical practice, pathologists base their medical diagnosis on the quantification and visual recognition of the analysed sample details, which could lead to diagnostic discrepancies and potential suboptimal patient care [5]. The increased adoption of non-invasive clinical procedures to acquire diagnostic samples has also severely reduced the quantity and quality of samples obtained, which compounds the workload of pathologists. In view of the inter-variabilities in analysing samples manually and the limitations of available samples, the use of DL analysis has thus been researched on and progressively applied in the clinical practice.
DL in digital pathology aims to improve the workload of pathologists by automating time-consuming tasks, hence allowing additional time to be spent on disease presentations with complex features. AI applications in digital pathology can also be applied to develop prognostic assays that evaluate the severity of diseases and make an accurate prognosis in response to therapy. This could be applied to various image processing and classification tasks, such as low-level jobs revolving around image recognition issues including detection and segmentation, as well as high-level tasks such as prognosis of response to therapy based on patterns of images [6,7]. Such AI approaches are designed to extract relevant image renditions to train machines to be used as specific segmentation, diagnostic or prognostic tools.
One of the most extensively used DL models in pathology image analysis is the Convolutional Neural Networks (CNN). The CNN is a class of deep, feedforward networks, comprising several layers which extrapolate an output from an input and contains multiple convolutional sheets. These convolution sheets are the foundation of a CNN in which the network learns and extrapolates feature maps from images using filters between the input and output layers [4]. These layers in CNN are not connected as the neurons in one layer only interact with a specific region of the previous layer instead of all its neurons. The CNN also contains pooling layers which primarily function to scale down or reduce the dimensionalities of the features. CNN DL-based approaches are used for image-based detection and segmentation tasks to distinguish and quantify cells, histological features or highlight regions of interest [4]. CNN DL-based approaches have also been developed to automatically distinguish and segment blurry areas in digitised WSIs with high accuracy.
Another type of DL approach is the Fully Convolutional Network (FCN) which learns representations from every pixel and makes a potential feature detection that may occur sparsely in the entire pathology image [4]. FCNs uses co-registered Haemotoxylin and Eosin (H&E) images with multimodal microscopy techniques to classify WSIs into 4 classes: cancer, non-malignant epithelium, background and other tissues. When FCN was used to detect invasive breast cancer regions on WSIs, it had a diagnostic accuracy of 71% (Sørensen-Dice coefficient) when compared to an expert breast pathologist's assessment [8]. With better technologies and further research, FCNs can potentially automate these tasks with a higher accuracy, reducing the workload of pathologists. AI-based approaches such as Generative Adversarial Network (GAN)-based approaches can be used for training to automatically score tumoral programmed cell death 1 ligand 1 (PD-L1) expression in biopsy sample images [4]. They reduce the number of inputs required from pathologists and make up for lack of tissue samples available in biopsy specimens. Novel GAN-based approaches propose converting H&E staining of WSIs to virtual immunohistochemistry staining, thus eliminating the need for destructive IHC tissue testing.
Many have also trialled DL in the field of immunohistochemistry (IHC). Traditional IHC is a common diagnostic tool used in pathology, but its application is significantly limited by its ability of only allowing single marker labelling per tissue section [9]. Alternatively, multiplex immunohistochemistry/immunofluorescence (mIHC/IF) technologies permit simultaneous detection of several markers on a single tissue section [9]. However, analysing large samples with multiple markers in conventional and manual ways by pathologists are highly time-consuming, laborious and susceptible to human error. By combining mIHC/IF with DL to analyse digitized WSIs, this will overcome the limitations.
In conclusion, the research and diagnostic fields have come a long way since the introduction of IHC. With the introduction of Al-based approaches in the application of IHC, higher accuracy and productivity could be achieved not just in the diagnostic level but also providing us with a platform to further venture into areas of medical knowledge yet to be fully understood.

mIHC/IF Technologies
To-date, our understanding of cancer immunotherapy has evolved and led to multiple studies investigating and refining strategies targeting negative regulators. Many have studied the use of checkpoint blockade immunotherapy such as programmed cell death receptor 1 (PD-1), PD-L1 and cytotoxic T-lymphocyteassociated protein 4 (CTLA-4) in a variety of cancers. The subsequent success of checkpoint blockade inhibition in clinical trials has led to the Food and Drug Administration's approval of various drugs such as Ipillimumab and Prembrolizumab for melanoma treatment of non-small cell lung cancer (NSCLC) respectively [10]. Furthermore, trial of combination immunotherapy has shown clinical efficacy in various cancers [11,12]. However, other studies have also suggested that efficacy of these immunotherapy in various cancers may depend on the expression of biomarkers. For example, PD-L1 is suggested as a useful predictive marker in patients with NSCLC receiving Prembrolizumab [13]. However, this is not the case in patients with stage III melanoma [14]. To further discover potential biomarkers that could determine the efficacy of immunotherapy in various cancers, IHC has been introduced as a platform for these clinical studies.
Since its introduction in the 1940s [15], conventional IHC has been widely used in field of pathology and research. It involves the process of staining tissues samples using antibodies specific to antigens present within the samples. This specificity allows microscopic visualisation for diagnosis of neoplasm and obtaining valuable prognostic information. Despite this, it does have several limitations. The inability of labelling more than one marker per tissue sample has resulted in loss of potential information for analysis. For instance, the prediction of prognosis to an immunotherapy such as PD-L1/PD-1 checkpoint blockade may depend on the expression of an individual biomarker or in combination with other biomarkers [16][17][18]. Furthermore, the immune system can potentially be better understood, if the analysis of various biomarkers' expression patterns are done simultaneously, or cellular interactions within the tumour microenvironment can be visualised [19].
Moreover, IHC involves many critical steps which have high inter-user variability. For instance, antigens such as Ki-67 are more vulnerable to ischemia. As such, over fixation could result in irreversible damage to these antigens [20]. The concern of IHC's reproducibility such as for Ki-67 and its implications was also mentioned in the 2017 St. Gallen International Expert Consensus Conference [21]. However, multiple studies have since demonstrated that analytical variability can be negated with the use of digital analysis to calculate biomarkers index [22,23].
Although conventional IHC is a cost-effective diagnostic and prognostic tool, it has been replaced with the introduction of mIHC. mIHC has been used to overcome the shortcoming of single biomarker labelling in conventional IHC. The use of mIHC has proven to provide an even more accurate analysis as seen in the study by Yeong et al., where the simultaneous quantification of three different PD-L1 antibodies (22C3, SP142 and SP263) by mIHC scoring had moderate-to-strong correlation (with 67%-100% individual concordance rates and Spearman's rank correlation coefficient values up to 0.88 [24]) when compared with manual scoring from four different pathologists.. This demonstrated the use of mIHC as a promising tool for an even more accurate analysis.
The use of mIHC has played a significant role in both research and clinical studies of cancer immunotherapy. mIHC is a relatively new tool to study the spatial tumour microenvironment especially those of limited tissue specimens. It has great potential in clinical and translational application. This was demonstrated by Halse et al., who used mIHC to reveal a close relationship between the presence of CD8 + T cells within the tumour and the expression of PD-L1 in melanoma [25]. A systemic review and metaanalysis of studies also reported that mIHC improved results in predicting responses to PD-1/PD-L1 checkpoint blockade immunotherapy in various solid tumour types when compared to using conventional IHC analysis [26]. Several studies have also used various types of mIHC to obtain data for analysis. For instance, TSA-based mIHC was used to profile PD-1 to PD-L1 proximity in 166 metastatic melanoma samples and 42 Merkel cell carcinoma samples in two respective studies [27,28]. As aforementioned, understanding the tumour microenvironment could potentially provide a foundation upon which interpretation of immunotherapy response could be made.

Use of mIHC in combination with digital pathology
mIHC can be powered by digital pathology analysis software, such as inForm (Akoya Biosciences, California, USA) [29][30][31] and HALO TM (Indica Labs) [28,32]. These software resolve the restrictions of labeling a single marker per tissue section by precisely evaluating the unique localization of multiple simultaneously detected biomarkers and their co-expressions or interactions between cells [33]. For example, although Ki-67/PD-L1 labeling is useful by itself, a multiplex approach enables several markers to be interrogated simultaneously [34][35][36]. However, only analytical digital pathology solutions for Ki67 and PD-L1 scoring are currently commercially available as listed in Table 1 [33,37]. The involvement of digital pathology has also decreased intra-and inter-observer variability seen in manual scoring as previously highlighted. Consequently, using mIHC in conjunction with digital analysis software will resolve the restrictions of conventional IHC, thus providing us with an accurate and powerful tool in the interpretation of immune response in various fields.

Proposed deep learning framework for analysing immunopathological and clinical data
This section presents a holistic guiding framework to select and develop a DL architecture for multi-dimensional analysis. The entire pipeline can be broken down into 3 parts: [1] data pre-processing, [2] feature engineering and [3] model selection, validation, and evaluation (Figure 1). This includes treating the data input, selecting the appropriate model for the type of data and using the preferable method to validate the selected model.
To demonstrate the clinical application of the framework, a total of 107 clinical as well as mIHC/IF data from patients with breast cancer (BC) previously published [37]. The clinical data consists of parameters such as age and tumour grade as stated in Table 2 Row 1, while the mIHC data comprised of antibody-based spectral unmixing result obtained from stained mIHC image of tumour section labeling markers such as cytokeratin, CD68, CD8, CD20, FOXP3, PD-L1, and CK/EpCAM (Figure 2).

Data pre-processing
The first step in data pre-processing involves analysis of the dataset. This process consists of four main components: [1] one-hot encoding, [2] data normalization, [3] data enhancement and [4] data shape conformity.

One-hot encoding
One-hot encoding is the process of converting any non-numerical data existing in the clinical dataset to a categorical numerical representation that is readable by the computer. Any non-numerical data within each category is split into the number of categories it has and encoded with a binary 0/1. For example, in the case of our clinical data, the columns, "Lymphovascular Invasion" contains 3 possible values: positive, possible, negative ( Table 3). This represents 3 categories and is the prime candidate to be one-hot encoded. The input column is subsequently expended to 3 columns, one for each category in this input as shown in Tables 3 and 4.

Data normalisation
Most CNN research and models are developed with the intention for application in Computer Vision, where an entire input image data points are all pixels with ranges from zero to 255. Non-imaging datasets are more complicated as each input parameters have different units of measurements that might range from ones  to hundreds of thousands. Using such models with disparate values meant that a model with a large input parameter could easily outweigh another with a smaller value range. Therefore, data normalisation is needed to ensure that the dataset has comparable values across the data inputs while still maintaining their distribution within each data input. Data normalisation was done by scaling each input column to carry a mean of 0 and a standard deviation, by applying the following formula: An example of a segment of clinical dataset following one-hot encoding is as shown in Table 5. A notable feature of the clinical dataset is the disparate values across the columns which arose due to the different units of measurements used across the columns, such as categorical numbers, months, and millimetres. As such, these numbers could not be directly compared. To obtain a more comparable data, normalisation of these values was done, while maintaining the distribution within each column ( Table 6).

Data enhancements
When working with a medical dataset, it will be advantageous to have medical insights augment the data, as it can improve the result. The use of medical insights is however dependent on the context of the problem and is subjective to the augmentation or removal of features and/or any dataset. In this study, clinically relevant data was augmented to the cell dataset to count the number of stroma and cancer cells of each patient. Subsequently this was evaluated with a simple 12-layer dense neural network and the obtained results were compared with and without data enhancements on 10000 epochs. It was discovered that there was a marked improvement in the reduction of mean absolute error by 14.8% when the clinical dataset was enhanced with more relevant information. However, the reduction in mean absolute error was highly dependent on clinical dataset used and thus varies with its application.

Ensuring data conformity
The CNN requires the dataset to be homogeneous in its shape, which is achievable in the classical Computer Vision problems where images could be resized to a uniform rectangular shape. However, in the case of medical dataset, the dimensions are mostly dependent on the source of the data, which is usually 3-dimensional or more. There are two ways to homogenise medical dataset, either by appending nonmeaningful data to the clinical dataset, or selectively removed data until the shape is uniform. This process requires a higher dimensional visualisation which is best explained using a tangible example as follows:    In this study, the cell dataset has the attributes of a 3D feature shape, where each patient has a list of cell inputs. This gives it a general 3D shape consisting of (Patients x No of Cells x Cell Parameters x 1). However, as each patient has a different number of cells, this results in a non-uniform dataset shape of (104 patients x χ cells x 144 Cell Parameters x 1), whereχ denotes the patient dependent number of cells in the dataset which ranges from 991 cell inputs to 10720 cell inputs.
Two strategies were evaluated to assess for the best approach of ensuring data conformity. The first method involves keeping the model and parameters constant. A dataset (Complete Cell Data) of "ghost cells" with value 0 were appended to the data to ensure regular shape of (107 x 10720 x 144). Alternatively, a dataset (Random Sampled Cell Data) with cells that were sampled randomly are pegged to the patient with the least number of markers to create a shape of (107 x 991 x 10720) ( Table 7). Random sampled cell data are also employed for the training and evaluation of the DL model in this study. Following evaluation, no difference in the accuracy between both sets of data were observed. However, it was found that the smaller dataset required a smaller DL network that requires less computational power.

Feature engineering
In retrospect, there is no definitive answer when selecting a DL model. However, there is a wide array of models including the basic dense layers or CNN that could be applied. In view of multi-dimensional datasets, CNN is the most versatile in its ability to accommodate multi-dimensionality and has a strong community of research & development from Academics to Corporations. Despite so, CNN may be unsuitable for a non-imaging problem, as most CNN research is based on imaging problems where many of its tools such as max pooling may only work for spatial data. However, if harnessed correctly, CNN offers a highly flexible and advanced architecture that works for many types of data.
To understand the limitations of CNN on non-imaging dataset, it is essential to understand the fundamental difference between a spatial and non-spatial data. In spatial data like an image, a data point in one position is highly related to its surrounding pixels. Whereas for purely numerical data, one data point may not be related to its surrounding data points. It could instead be related to another data that is located at a different position, or more succinctly, is position independent.

Model selection and parameters
Most CNN tools assume that data points are position dependent. In this study, we looked at the dataset at hand to select a suitable CNN model and to adapt a powerful CNN tool called Pooling Layer to the non-spatial data. To select a suitable model that best fits the dataset and problem at hand, one should consider the general dimensions of the dataset which dictates the type of CNN to be used as listed in Table 8. For models that involve interaction with the environment, agent-based  models may be used. Furthermore, one should consider if the problem is a prediction or classification problem and if additional correlational features are necessary.
As for a complex problem, the use of a deeper model may be more appropriate. Nevertheless, using a deeper model could lead to additional problems that require new architecture to overcome. Lastly, the objective of the problem and if it is a scalar prediction or classification problem must also be deduced ( Table 9).
In this study, 2D CNN is used as the dataset is 3D and image-like. The problem type is a prediction model thus a plain 2D CNN with no agent-based is used. Given that the problem is complex, a Wide Residual Neural Network (WRN) identity mapping may help with modelling its complexity. Lastly, as this is a scalar prediction problem, the model should end with 1 sigmoid function and mean absolute error (MAE) objective function ( Table 9).
The model selection process must be carefully chosen as it dictates the basis of the model and its result. To illustrate, an early stage proof-of-concept of application of DL on this dataset, a categorical approach was taken, where the patients were split into categories based on their survival rate in years. In designing it this way, the aim was to apply categorisation as the objective function [38]. However, this approach introduced an unintended consequence, a fixed error of the range of each category that could not be rid of regardless of how accurate the model is. This was because of framing a scalar problem as a categorical problem. Even though the resulting model achieved an accuracy of 90% [38], it did not show the in-built error of the prediction that was hidden by the range of each category.
Pooling layers progressively achieve spatial invariance by reducing the resolution of the feature maps, which reduces the number of parameters and computation in the network. This presents one with the ability to create a much deeper network with limited computational cost and overfitting. In a pooling layer, a simple function could be applied. The two conventional functions available are [1] maximization function, which find the maximum value of the region as a representation, and [2] average pooling function that aims to find an average representation of the region, where p is the resultant value of the pooling operation (Figure 3).    However, conventional form of a rectangular pooling layer is not applicable in datasets with only vertical relations. Pooling is mainly done in the context where all data in a 2D array are spatial, where any integers within the array size are related spatially. Such techniques help to compact representations, which could greatly influence the model's performance. With regards to this study, a 2D non-spatial cell dataset, each row has a different unit such as size (mm) or standard deviations. Pooling together variables of different types would result in an invalid representation. Thus, a different form of a pooling layer for non-spatial data could be created instead. Such rectangular pooling seeks to pool between data of the same type to create a representative value of the region, while reducing data noise, and the parameter size of the network.
Furthermore, the operation of the study aims to take a sample of group of nine cell markers of the BC dataset and to obtain the maximum value of each set. A graphical representation of the operation of max rectangular pooling layer (RPL) is shown in Figure 4.  In another experiment comparing a plain vanilla 2D CNN with RPL of 9 x 1 dimension and a conventional square pooling layer (SPL) of 3 x 3 dimension, it showed both having the same max function ( Table 10). Comparing the training record of both pooling shapes, the RPL generalised at a much faster rate, of about 500 epochs ahead of SPL to achieve the same MAE. The former also achieved a lower MAE at the end of the training. In the context of a large dataset and DL network, using RPL in a non-spatial 2D dataset could achieve significant reduction in computational time.

Validation
In the context of medical dataset, one common hampering factor is having a small dataset. This results in a validation process that is not robust enough as there may be an uneven distribution of data across the dataset. Traditional holdout validation is not rigorous enough to negate this effect and may result in an unfair representation of the efficacy of the model. This could be overcome with the use of K-Fold cross validation (K-cv), which is done by splitting the dataset by k iteratively holding out the sections of the data and evaluating the model with an average prediction error of all k evaluations (Figure 5).   K-cv provides a more robust way of validating a model by validating a model with the entire data set. A study by Rodriguez et al. confirmed that K-cv reduces variance in prediction error and recommends implementing a K-cv whenever computationally possible [39]. In this study, the BC dataset was split into four groups of 23 patients and the standard deviation σ of each group is evaluated. It was discovered that the σ across all four groups was 8.43 months, which is a substantial amount in its ability to misrepresent the efficacy of the dataset.

Evaluation
The evaluation step acts as a feedback loop to the development of the CNN model. An iterative approach must be taken to analyse these results from a DL and a medical point of view to understand how further improvements could be made to the CNN model. Firstly, a model was built to evaluate clinical dataset followed by another model to evaluate the cell dataset. In an experiment with 107 patients, an adaptation of Dense ResNet [40] to the clinical data was used. A 2D CNN Wide Residual Network (WRN) [41] was also adapted for the cell data (Figure 6). A benchmark was developed on the dataset as a starting point for comparison, as this was a greenfield application. A simple vanilla dense network was used as a starting point to benchmark the results for the clinical dataset containing patients' information such as age, ethnicity, and tumour size. For the immunopathological dataset, we use a benchmark CNN model from the imaging domain as our starting point. MobileNet50 V2 [42] was chosen for the starting benchmark for   immunopathological dataset due to its accuracy, and training speed secondary to its small size (Table 11).
A sample k-fold training record as shown in Figure 7, shows the overfitting tendencies as the training error is minimised, but the validation error is not minimised. This could be attributed to the unsuitability of the models in the imaging domain without adaptation, which emphasises the importance of using the framework to adapt available CNN models to specific needs. In this study, a more suitable model was developed to clean up, augment, and enhance our dataset following the steps of the framework.
As shown in were normalised to develop a more suitable CNN using WRN with RPL. Significant improvements in both the cell and clinical dataset were seen, which is appended with immunopathological data. The results were further improved by including a threshold based on the patients' survival rate.
A filter of patients with lower survival rate was experimented where the dataset was split with an arbitrary cut-off of ≥12 months, ≥16 months, and ≥ 20 months. Evaluating using the same unified model, Table 13 showed the following MAE on 5-fold K-cv for each cut-off, where comparison of the combined clinical dataset and cell dataset were made after applying cut-off filter. The clinical dataset augmented with the stroma and tumour count from the cell dataset is also reflected in Table 13 for reference. The increased in cut-off threshold meant that the model had a smaller dataset. Therefore, an increased in MAE of the model was expected, which was in line with the results shown in Table 13.

Limitations
Some limitations of this study should be noted. Firstly, this study uses a small dataset, which meant that the results could be less robust and of a lower confidence level. Although, this was minimised with the use of k-fold cross validation, more advanced techniques such as semi-supervised learning could be explored to augment the dataset. Secondly, there is currently no medical evidence to support using a cut-off to segregate patients as a valid approach. The approach used in this study is solely from a DL standpoint and therefore requires more medical based research to prove its validity. Moreover, given the novelty of the proposed framework, there is currently limited literature to support its application in other medical domains.

Conclusions
The adaptation of DL technology with the use of mIHC in the analysis of complex data is in the upcoming alternative approach of analysis in the field of immunopatholgy. However, given its novelty, further studies are needed to optimised the framework to enable application in varies medical field. Nevertheless, the framework proposed in this chapter serves to provide a starting foundation for application in clinical studies.

Author contributions
Conceptualization and design, Y. Chua and J. Yeong; literature review, S. Goh and Y. Chua; writing-original draft, S. Goh and Y. Chua; intellectual input and