An Explainable Machine Learning Model for Early Prediction of Sepsis Using ICU Data

Early identification of individualswith sepsis is very useful in assisting clinical triage and decision-making, resulting in early intervention and improved outcomes. This study aims to develop an explainable machine learning model with the clinical interpretability to predict sepsis onset before 6 hours and validate with improved prediction risk power for every time interval since admission to the ICU. The retrospective observational cohort study is carried out using PhysioNet Challenge 2019 ICU data from three distinct hospital systems, viz. A, B, andC.Data fromAandBwere shared publicly for training and validation while sequestered data from all three cohorts were used for scoring. However, this study is limited only to publicly available training data. Training data contains 15,52,210 patient records of 40,336 ICU patients with up to 40 clinical variables (sourced for each hour of their ICU stay) divided into two datasets, based on hospital systems A and B. The clinical feature exploration and interpretation for early prediction of sepsis is achieved using the proposed framework, viz. the explainable Machine Learningmodel for Early Prediction of Sepsis (xMLEPS). A total of 85 features comprising the given 40 clinical variables augmented with 10 derived physiological features and 35 time-lag difference features are fed to xMLEPS for the said prediction task of sepsis onset. A ten-fold cross-validation scheme is employedwherein an optimal prediction risk threshold is searched for each of the 10 LightGBMmodels. These optimum threshold values are later used by the correspondingmodels to refine the predictive power in terms of utility score for the prediction of labels in each fold. The entire framework is designed via Bayesian optimization and trained with the resultant feature set of 85 features, yielding an average normalized utility score of 0.4214 and area under receiver operating characteristic curve of 0.8591 on publicly available training data. This study establish a practical and explainable sepsis onset prediction model for ICU data using appliedML approach, mainly gradient boosting. The study highlights the clinical significance of physiological inter-relations among the given and proposed clinical signs via feature importance and SHapley Additive exPlanations (SHAP) plots for visualized interpretation.


Introduction
Sepsis is an enigmatic clinical condition that occurs when the patient's body reacts adversely to infection and as a consequence develops organ dysfunction.
Sepsis can practically affect all organ systems however, the organs involved and the degree of dysfunction varies distinctly among patients and can even lead to death in most cases [1,2]. In the early stages of the disease, the treatment of sepsis seems to be relatively easy with the availability of broad-spectrum antibiotics [3]. While in the later stages of the disease, diagnosis of sepsis becomes much easier but extremely hard to treat. Therefore, early diagnosis of sepsis is the need of the hour for better clinical management [4].
Current manual assessment of sepsis using screening tools, like the Sequential Organ Failure Assessment (SOFA) score for ICU-patients, are complex in terms of measured clinical signs and even lack adequate sensitivity [5,6]. On the other hand, AI and machine learning-based automated clinical decision support systems that use easily accessible clinical data have reflected a significant improvement in agreement with these treatment protocols in ICUs by guiding physicians through predefined work-flows [7][8][9][10][11]. In the current era wherein we have abundant availability of electronic medical records (EMRs) has brought more feasibility to such automated realizations [12]. However, almost every machine learning (ML)-based AI model and automated decision support system lack proper explainability because of their uninterpretable black-box nature [13,14]. This is where Explainable Artificial Intelligence (XAI) comes in rescue to address some of these restrictions imposed by a Black-box AI system by adding explainability. And thus assist clinicians in the interpretation of their diagnosis, and recommend future actions to be taken thereby improving the quality of predictions [15][16][17]. The development of such an explainable ML framework for sepsis onset prediction is an important and active area of investigation.
This work presents a novel clinical application of developing an explainable ML framework for sepsis onset prediction among ICU patients based on the physiological medical knowledge of given clinical signs, obtained via extensive analysis, and using popular gradient boosting ML techniques. The framework's design includes an optimal explainable gradient boosting architecture for clinical decision making that investigates questions of generalizability and interpretability of the proposed system.

Methods
An overview of the proposed methodology from raw data to explainable decision framework is shown in Figure 1.

Dataset and study population
The publically available training set consists of data from two cohorts [18]. Cohort A has 790,215 records of 20,336 patients. Cohort B has 761,995 records of 20,000 patients. Particularly, data for every patient record contains 40 clinical covariates i.e. 8 vital signs, 26 laboratory values, and 6 demographic values. The labeling of the patient data was done adhering to Sepsis-3 clinical criteria. Table 1 presents the details of various clinical covariates used under study together with their missing information in percentage [18,19].

Feature extraction
Feature extraction takes place on the imputed version of given clinical data that generates features sample-wise on an hourly time grid. Two types of features ware generated namely:  [7,20]. By reviewing various studies that justify the clinical significance of well-established physiological inter-relations among the given clinical signs 10 such physiological relations are derived from the given covariates: Three Shock Indices firstly the well defined Shock Index (SIndex) using Systolic BP and the other two are its modified versions proposed in this study for Diastolic BP (DPBSIndex) [21] and Mean  Arterial Pressure (MAPSIndex) [22] followed by ratios BUN/Creatinine (BUNCr) [7], Bilirubintotal/Creatinine (BILTcr), SaO2/FiO2 [23], PaO2/FiO2 [24], Platelets/ Age (PlaAge), the difference between SBP and DBP called Pulse Pressure (PP) [25], and lastly Cardiac Output (CO) [26]. Table 2  Finally, the obtained 45 features are combined with the given 40 clinical signs, thereby increasing the final feature count to 85 features. The resultant feature set is then fed to train the proposed xMLEPS framework.

Implementation of xMLEPS
Together with Bayesian optimization and the refinement of prediction risk threshold an optimal disease onset detection method before six hours for sepsis called xMLEPS is developed. As shown in Figure 1 the given clinical sepsis data has large amount of missing information (approximately 20%). So at the onset of the algorithm computation, filling of these missing values is carried out as as a preprocessing step. The data imputation to fill in the missing values is done by employing forward fill imputation on the given EHR clinical data. In the real-time scenario, the current missing values encountered are to be filled with previous  In this study, imputation is carried out into two rounds: first local imputation, for each individual record, and then global imputation for all the combined records together. In the case of local imputation, the trailing missing values in a row for a particular clinical covariate (or feature vector) are forward filled with the nearest past non-missing value in that row locally for the given record. Ipso facto, if the record encounters 'NaN' values, in the beginning, i.e. for the first alone measurement at t = 0, they are retained as it is initially and then later replaced with 'global mean' for that covariate row obtained by combining all records [19].
During model development, a ten-fold cross-validation scheme is employed wherein 10 LightGBM classifiers with the same complexity of model hyperparameters obtained during Bayesian optimization are developed for the corresponding fold. The total feature set used to develop these models comprises of 85 features as described in sub-Section 2.2. Generally, hyper-parameter optimization aims at looking for the best hyper-parameter values to minimize the objective loss function. The hyper-parameter settings maximizing the custom-defined challenge metric-utility score on the subset of training data during the Bayesian optimization phase are later used to build models. These built models generate the predictions on the hold out 10% of validation data in each fold. The training process of the model in each fold stops when the utility score of the validation set does not show further improvements over 32 consecutive iterations, i.e. early stopping to best iteration is achieved to reset the model and thereby to avoid over-fitting.
The initial predictions generated by each optimal model on the corresponding validation data of each fold undergo refinement of the prediction risk threshold to enhance the utility score. The search space for the prediction risk thresholds lies in the range of 0 to 1 and is varied in steps of 0.05. Thus the threshold search space has 20 values. So the initial predictions of validation data of each fold are compared with each of these 20 values. After comparison, the threshold value that gives the maximum utility score for the set of predictions of that fold is said to be optimal. Such 10 optimum threshold values are later used by the corresponding models to refine the predictive power in terms of utility score for generated labels in each fold.
This LightGBM based gradient boosting framework serves with a specific processing method for sparse data which is important in our classification task with class imbalance problem [27]. For the interpretability of the proposed framework, the LightGBM uses its feature importance attribute to quantify each variable, and the explainability component is addressed by employing SHAP summary and dependency plots wherein the distribution of the variable importance is illustrated [28,29].

Results
The proposed framework performs the prediction from the given patientrecords to determine the risk of development of sepsis onset in the next 6 hours. This is achieved using a continuous-valued utility score as defined by challenge organizers for each prediction [18]. The utility function rewards or penalizes classifiers for their predictions within 12 hours before and 3 hours after sepsis onset time and was normalized as described in [18]. Using a ten-fold cross-validation scheme 10 LightGBM models are designed based on patient-wise stratified ten-folds each containing unique 10% of the entire training set. The hyper-parameters of the above models that minimize cross-validation loss are obtained by using automatic hyper-parameter optimization utility 'bayesopt' in Python [30,31]. The underlying 6

Infections and Sepsis Development
Please use Adobe Acrobat Reader to read this book chapter for free.
Just open this same document with Adobe Reader. If you do not have it, you can download it here. You can freely access the chapter at the Web Viewer here. objective function formulated for the optimization is intended to maximize the AUROC. The given software utility finds optimal parameters automatically using Bayesian optimization. At the outset, the optimized models includes: 60 'num_leaves', 120 ' min _data_in_leaf ', ' max _depth' of 2, 'learning_rate' of 0.01, 'scale_pos_weight' of 20, ' min _samples_split' of 4. Table 3 gives a summary of the results by the proposed framework on the entire training data in a ten-fold cross-validation scheme. Results also include performances of inter-cohort and baseline studies. To ensure that the models trained in the proposed study learn dependencies not only between the patient-records but also among the cohorts, we considered inter-cohort training and testing scheme. i.e. model trained with the data of cohort A was scored on cohort B data and vice versa. This certainly avoids the doubt of the over-fitting, thus increasing the robustness of the framework. Inter-cohort scores for A and B were 0.3191 and 0.3284 respectively.

Comparison of xMLEPS with baseline
Further, to emphasize the clinical relevance of the derived features under this proposed method, a comparative analysis of results is done by carrying out three baseline studies as shown in Figure 2.
As a part of comparative analysis three well-tuned baseline studies are performed: Firstly, the proposed method with feature set of 85 features is tested without optimal threshold refinement (default threshold value of 0.5 with no skill is used) in a 10-fold cross-validation scheme. In the second and third methods, the given 40 clinical variables only are directly fed to LightGBM models with and without refinement of optimal threshold respectively in a 10-fold cross-validation scheme. Table 3 presents the results of these three baseline studies accordingly. As expected the proposed method xMLEPS outperforms these three studies. The third study carried out without derived features and optimal threshold refinement shows  worst performance. Even for the first baseline study, results are significantly lower by 3% in terms of the utility score as compared to the proposed method.

Explanation and visualization of feature importance
The cumulative feature importance of the first top 50 features is shown in Figure 3. Here the LightGBM feature importance attribute is used for the gradient boosting framework developed. The approach used is to count the number of times a feature gets involved to split the dataset across all trees. The failure of such an approach is that it accounts for different impacts due to different splits. The next best approach is to attribute the gain achieved with the reduction in average training loss when using a feature for splitting. This "Gain" measure used for feature importance recovers the correct mutual information between feature inputs and label outputs [32]. The limitation of this approach is that it gets easily biased when greedy trees are built in the finite ensembles. So other methods are designed to compensate for the bias in feature selection using gain approach [33,34].
SHAP summary plot with the 20 most important clinical features that cause sepsis onset identified by the xMLEPS framework is shown in Figure 4(a). Here the approach used for the feature importance is to sort all the relevance scores across the entire population in decreasing order of mean relevance as computed for local, but considering only those individuals who were positive for sepsis. The mean relevance is displayed as blue horizontal bars in Figure 4(a). While local explanations summary is shown in Figure 4(b), wherein all the individual data points are displaced by mean relevance for sepsis and are colored by feature values. As shown from Figure 4(b) we can draw that the increase in the length of stay (ICULOS) and higher value of clinical ratio's like PaO2/FiO2, Shock indices: DBPSIndex and SIndex, etc. leads to the development of sepsis, whereas lower Platelets, DBP and Magnesium levels cause sepsis. These findings are found to be consistent with previous studies on it [7,21,35,36].
Further, the impact of each feature and the interactions among them for sepsis development can also be illustrated using SHAP dependency plots. As an example, in Figure 4(c) the dependency plot showing the interaction of Heart rate with ICULOS is depicted. As seen the xMLEPS model seems to associate high heart rate values in the range 120-180 with increased ICULOS and hence causing sepsis.    Further Figure 4(d) shows lower values of SBP (approx. Below 90) is associated with increase ICULOS causing sepsis. A summary plot of a SHAP interaction value matrix is shown in Figure 5 wherein the diagonal reflects the main effects, while across the diagonal show interaction effects. The explainable model will produce a high probability when it is confident about a decision, resulting in larger relevance scores due to the availability of more relevance for distributing backward. On the contrary, the model will output a lower probability when it is less confident about the patient to develop sepsis and as a result, yields lower relevant scores. This summary of scores distribution assists the clinicians with the hints to what to be expected from the designed model for clinical practice.

Discussion
This study justifies the clinical significance of the derived physiological interrelations among the clinical signs via feature importance and SHAP plots for visualized interpretation. Though SHAP values cannot be used as a generalized approach for early prediction of sepsis, they certainly help in generating relevant clinical hypotheses for desired events. The SHAP illustrations indeed assist in mitigating the concerns of the black-box issue associated with prediction models and might assist clinicians with a better understanding of the important features of the xMLEPS framework. The the proposed framework has the ability to establish the significance of the individual features contributing to enhance prediction of the utility score. Thus ensuring interpretablity of the framework to its clinical users. Furthermore, the proposed prediction framework, deploying clinical ICU data in the routine practice care can be potentially integrated into a computerized clinical decision support system instead of employing advanced molecular biomarkers.
The recent research literature relevant to early diagnosis of sepsis comes from the articles of various submission entries to PhysioNet Challenge 2019 [18]. This challenge aimed at the design and development of algorithms for early and automated prediction of sepsis onset with the optimal window definition of six hours before the actual clinical recognition of disease onset. The predictions of the machine learning algorithms were rewarded if they were able to detect true positives correctly up to 12 hours before disease onset and were slightly penalized if

Infections and Sepsis Development
Please use Adobe Acrobat Reader to read this book chapter for free.
Just open this same document with Adobe Reader. If you do not have it, you can download it here. You can freely access the chapter at the Web Viewer here. they were false positive. However predictions were strongly penalized if they were incorrect near disease onset. The reason for choosing the optimal prediction window to be six hours comes from the clinical fact that the ratio of observed median time to antimicrobial therapy is found to be 6 hours [37]. Furthermore, delay in each hour of treatment results in average decrease of survival rate of 7.6% [37].
The comparative analysis of the results obtained by the proposed method with our previous works [19,38] and submission approaches [39][40][41][42][43][44][45][46] that reported the best results in the PhysioNet 2019 Challenge [18] is listed in Table 4. Most of these approaches utilized 5 or 10 fold cross-validation scheme and yielded utility scores in the range of 0.36-0. 45.
This study supports the usage of the Utility score as an effective metric on ICU data for sepsis onset. However, experiments showed that even the F1 score gave reliable results aligning with utility score. i.e. the increase and decrease of F1 score follow accordingly with the Utility score. However, the bounds for utility score vary from À2 to 1 whereas the F1 Score has bounds from 0 to 1. The other conventional metrics namely AUROC, AUPRC, and Accuracy are insignificant to use with such a highly unbalanced dataset and are misleading for sepsis onset. Further, the fact that the interpretation of these results together with utility score is quite difficult cannot be ignored as mentioned by Roussel et al. [47].
The limitation of this study is, it constrains only to a two-center cohort design from the available training data, which might create doubt that the trained models may get over-fit towards the particular cohort data and it's patient-records. However, the analyzed ICU patient admissions originate from a diverse population covering the entire spectrum of ICU patients, and further, the validation in terms of inter-cohorts train-test approach along with optimum threshold refinement demonstrates the deployment of our framework in other ICUs.

Conclusion
This study presents xMLEPSan explainable machine learning framework for the early prediction of sepsis using clinical data in the ICU setting. These predictive explanations justify the clinical significance of physiological inter-relations among the given clinical signs via visualized interpretation. And thus assist the clinicians in