Details of the various clinical variables used under study along with missing values information in percentage.
Early identification of individuals with sepsis is very useful in assisting clinical triage and decision-making, resulting in early intervention and improved outcomes. This study aims to develop an explainable machine learning model with the clinical interpretability to predict sepsis onset before 6 hours and validate with improved prediction risk power for every time interval since admission to the ICU. The retrospective observational cohort study is carried out using PhysioNet Challenge 2019 ICU data from three distinct hospital systems, viz. A, B, and C. Data from A and B were shared publicly for training and validation while sequestered data from all three cohorts were used for scoring. However, this study is limited only to publicly available training data. Training data contains 15,52,210 patient records of 40,336 ICU patients with up to 40 clinical variables (sourced for each hour of their ICU stay) divided into two datasets, based on hospital systems A and B. The clinical feature exploration and interpretation for early prediction of sepsis is achieved using the proposed framework, viz. the explainable Machine Learning model for Early Prediction of Sepsis (xMLEPS). A total of 85 features comprising the given 40 clinical variables augmented with 10 derived physiological features and 35 time-lag difference features are fed to xMLEPS for the said prediction task of sepsis onset. A ten-fold cross-validation scheme is employed wherein an optimal prediction risk threshold is searched for each of the 10 LightGBM models. These optimum threshold values are later used by the corresponding models to refine the predictive power in terms of utility score for the prediction of labels in each fold. The entire framework is designed via Bayesian optimization and trained with the resultant feature set of 85 features, yielding an average normalized utility score of 0.4214 and area under receiver operating characteristic curve of 0.8591 on publicly available training data. This study establish a practical and explainable sepsis onset prediction model for ICU data using applied ML approach, mainly gradient boosting. The study highlights the clinical significance of physiological inter-relations among the given and proposed clinical signs via feature importance and SHapley Additive exPlanations (SHAP) plots for visualized interpretation.
- early prediction
- machine learning
- explainable AI
- electronic health records
- clinical informatics
- critical care
- model-based diagnosis
Sepsis is an enigmatic clinical condition that occurs when the patient’s body reacts adversely to infection and as a consequence develops organ dysfunction. Sepsis can practically affect all organ systems however, the organs involved and the degree of dysfunction varies distinctly among patients and can even lead to death in most cases [1, 2]. In the early stages of the disease, the treatment of sepsis seems to be relatively easy with the availability of broad-spectrum antibiotics . While in the later stages of the disease, diagnosis of sepsis becomes much easier but extremely hard to treat. Therefore, early diagnosis of sepsis is the need of the hour for better clinical management .
Current manual assessment of sepsis using screening tools, like the Sequential Organ Failure Assessment (SOFA) score for ICU-patients, are complex in terms of measured clinical signs and even lack adequate sensitivity [5, 6]. On the other hand, AI and machine learning-based automated clinical decision support systems that use easily accessible clinical data have reflected a significant improvement in agreement with these treatment protocols in ICUs by guiding physicians through predefined work-flows [7, 8, 9, 10, 11]. In the current era wherein we have abundant availability of electronic medical records (EMRs) has brought more feasibility to such automated realizations . However, almost every machine learning (ML)-based AI model and automated decision support system lack proper explainability because of their uninterpretable black-box nature [13, 14]. This is where Explainable Artificial Intelligence (XAI) comes in rescue to address some of these restrictions imposed by a Black-box AI system by adding explainability. And thus assist clinicians in the interpretation of their diagnosis, and recommend future actions to be taken thereby improving the quality of predictions [15, 16, 17]. The development of such an explainable ML framework for sepsis onset prediction is an important and active area of investigation.
This work presents a novel clinical application of developing an explainable ML framework for sepsis onset prediction among ICU patients based on the physiological medical knowledge of given clinical signs, obtained via extensive analysis, and using popular gradient boosting ML techniques. The framework’s design includes an optimal explainable gradient boosting architecture for clinical decision making that investigates questions of generalizability and interpretability of the proposed system.
An overview of the proposed methodology from raw data to explainable decision framework is shown in Figure 1.
2.1 Dataset and study population
The publically available training set consists of data from two cohorts . Cohort A has 790,215 records of 20,336 patients. Cohort B has 761,995 records of 20,000 patients. Particularly, data for every patient record contains 40 clinical covariates i.e. 8 vital signs, 26 laboratory values, and 6 demographic values. The labeling of the patient data was done adhering to Sepsis-3 clinical criteria. Table 1 presents the details of various clinical covariates used under study together with their missing information in percentage [18, 19].
|Sr. no.||Covariates||Missing values (%)||Units|
|36||Gender||—||Male (1) or Female (0)|
|37||Unit 1||39.42||true (1) or false (0)|
|38||Unit 2||39.42||true (1) or false (0)|
|41||SepsisLabel||—||septic (1) nonseptic (0).|
2.2 Feature extraction
Feature extraction takes place on the imputed version of given clinical data that generates features sample-wise on an hourly time grid. Two types of features ware generated namely:
|1||SIndex||Shock Index (SIndex) is the proportion of heart rate (HR) being divided by systolic blood pressure (SBP), normalized by age.||(HR/SBP)*Age|
|2||DBPSIndex||Diastolic Shock Index is the proportion HR being divided by systolic blood pressure (DBP), normalized by age.||(HR/DBP)*Age|
|3||MAPSIndex||It is defined as the proportion of HR being divided by Mean Arterial Pressure (MAP), normalized by age.||(HR/MAP)*Age|
|4||BUNCr||It is the ratio of Blood Urea Nitrogen(BUN) to Creatinine||BUN/Creatinine|
|5||BILTCr||It is the ratio of Direct Bilirubin (Bilirubin_total) to Creatinine||Bilirubin_total/Creatinine|
|6||SaO2 -FiO2||It is the ratio of oxygen saturation of arterial blood in percentage (SaO2) to the fraction of inspired oxygen (FiO2).||SaO2/FiO2|
|7||PaO2 -FiO2||It is defined as proportion of the partial pressure of oxygen PaO2 divided by the fraction of inspired oxygen (FiO2).||PaO2/FiO2|
|8||Pla_Age||It is the ratio of platelets to age||Platelets/Age|
|9||PP||Pulse Pressure (PP) is the difference between SBP and DBP||SBP-DBP|
|10||CO||Cardiac Output is the product of pulse pressure (PP) and HR.||PP * HR|
Finally, the obtained 45 features are combined with the given 40 clinical signs, thereby increasing the final feature count to 85 features. The resultant feature set is then fed to train the proposed xMLEPS framework.
2.3 Implementation of xMLEPS
Together with Bayesian optimization and the refinement of prediction risk threshold an optimal disease onset detection method before six hours for sepsis called xMLEPS is developed. As shown in Figure 1 the given clinical sepsis data has large amount of missing information (approximately 20%). So at the onset of the algorithm computation, filling of these missing values is carried out as as a pre-processing step. The data imputation to fill in the missing values is done by employing forward fill imputation on the given EHR clinical data. In the real-time scenario, the current missing values encountered are to be filled with previous available measurements. Thus only the previous clinical values of given EHR data are fetched for data imputation of current observation.
In this study, imputation is carried out into two rounds: first local imputation, for each individual record, and then global imputation for all the combined records together. In the case of local imputation, the trailing missing values in a row for a particular clinical covariate (or feature vector) are forward filled with the nearest past non-missing value in that row locally for the given record. Ipso facto, if the record encounters ‘NaN’ values, in the beginning, i.e. for the first alone measurement at t = 0, they are retained as it is initially and then later replaced with ‘global mean’ for that covariate row obtained by combining all records .
During model development, a ten-fold cross-validation scheme is employed wherein 10 LightGBM classifiers with the same complexity of model hyper-parameters obtained during Bayesian optimization are developed for the corresponding fold. The total feature set used to develop these models comprises of 85 features as described in
The initial predictions generated by each optimal model on the corresponding validation data of each fold undergo refinement of the prediction risk threshold to enhance the utility score. The search space for the prediction risk thresholds lies in the range of 0 to 1 and is varied in steps of 0.05. Thus the threshold search space has 20 values. So the initial predictions of validation data of each fold are compared with each of these 20 values. After comparison, the threshold value that gives the maximum utility score for the set of predictions of that fold is said to be optimal. Such 10 optimum threshold values are later used by the corresponding models to refine the predictive power in terms of utility score for generated labels in each fold.
This LightGBM based gradient boosting framework serves with a specific processing method for sparse data which is important in our classification task with class imbalance problem . For the interpretability of the proposed framework, the LightGBM uses its feature importance attribute to quantify each variable, and the explainability component is addressed by employing SHAP summary and dependency plots wherein the distribution of the variable importance is illustrated [28, 29].
The proposed framework performs the prediction from the given patient-records to determine the risk of development of sepsis onset in the next 6 hours. This is achieved using a continuous-valued utility score as defined by challenge organizers for each prediction . The utility function rewards or penalizes classifiers for their predictions within 12 hours before and 3 hours after sepsis onset time and was normalized as described in . Using a ten-fold cross-validation scheme 10 LightGBM models are designed based on patient-wise stratified ten-folds each containing unique 10% of the entire training set. The hyper-parameters of the above models that minimize cross-validation loss are obtained by using automatic hyper-parameter optimization utility ‘bayesopt’ in Python [30, 31]. The underlying objective function formulated for the optimization is intended to maximize the AUROC. The given software utility finds optimal parameters automatically using Bayesian optimization. At the outset, the optimized models includes: 60 ‘’, 120 ‘’, ‘’ of 2, ‘’ of 0.01, ‘’ of 20, ‘’ of 4.
Table 3 gives a summary of the results by the proposed framework on the entire training data in a ten-fold cross-validation scheme. Results also include performances of inter-cohort and baseline studies. To ensure that the models trained in the proposed study learn dependencies not only between the patient-records but also among the cohorts, we considered inter-cohort training and testing scheme. i.e. model trained with the data of cohort A was scored on cohort B data and vice versa. This certainly avoids the doubt of the over-fitting, thus increasing the robustness of the framework. Inter-cohort scores for A and B were 0.3191 and 0.3284 respectively.
|Average (Std)||0.8591 (0.0085)||0.1502 (0.0286)||0.4214 (0.0148)||—|
|xMLEPS||Set A (Training) and Set B (Test)||0.3191|
|xMLEPS||Set B (Training) and Set A (Test)||0.3284|
3.1 Comparison of xMLEPS with baseline
Further, to emphasize the clinical relevance of the derived features under this proposed method, a comparative analysis of results is done by carrying out three baseline studies as shown in Figure 2.
As a part of comparative analysis three well-tuned baseline studies are performed: Firstly, the proposed method with feature set of 85 features is tested without optimal threshold refinement (default threshold value of 0.5 with no skill is used) in a 10-fold cross-validation scheme. In the second and third methods, the given 40 clinical variables only are directly fed to LightGBM models with and without refinement of optimal threshold respectively in a 10-fold cross-validation scheme. Table 3 presents the results of these three baseline studies accordingly. As expected the proposed method xMLEPS outperforms these three studies. The third study carried out without derived features and optimal threshold refinement shows worst performance. Even for the first baseline study, results are significantly lower by 3% in terms of the utility score as compared to the proposed method.
3.2 Explanation and visualization of feature importance
The cumulative feature importance of the first top 50 features is shown in Figure 3. Here the LightGBM feature importance attribute is used for the gradient boosting framework developed. The approach used is to count the number of times a feature gets involved to split the dataset across all trees. The failure of such an approach is that it accounts for different impacts due to different splits. The next best approach is to attribute the gain achieved with the reduction in average training loss when using a feature for splitting. This “Gain” measure used for feature importance recovers the correct mutual information between feature inputs and label outputs . The limitation of this approach is that it gets easily biased when greedy trees are built in the finite ensembles. So other methods are designed to compensate for the bias in feature selection using gain approach [33, 34].
SHAP summary plot with the 20 most important clinical features that cause sepsis onset identified by the xMLEPS framework is shown in Figure 4(a). Here the approach used for the feature importance is to sort all the relevance scores across the entire population in decreasing order of mean relevance as computed for local, but considering only those individuals who were positive for sepsis. The mean relevance is displayed as blue horizontal bars in Figure 4(a). While local explanations summary is shown in Figure 4(b), wherein all the individual data points are displaced by mean relevance for sepsis and are colored by feature values. As shown from Figure 4(b) we can draw that the increase in the length of stay (ICULOS) and higher value of clinical ratio’s like PaO2/FiO2, Shock indices: DBPSIndex and SIndex, etc. leads to the development of sepsis, whereas lower Platelets, DBP and Magnesium levels cause sepsis. These findings are found to be consistent with previous studies on it [7, 21, 35, 36].
Further, the impact of each feature and the interactions among them for sepsis development can also be illustrated using SHAP dependency plots. As an example, in Figure 4(c) the dependency plot showing the interaction of Heart rate with ICULOS is depicted. As seen the xMLEPS model seems to associate high heart rate values in the range 120–180 with increased ICULOS and hence causing sepsis. Further Figure 4(d) shows lower values of SBP (approx. Below 90) is associated with increase ICULOS causing sepsis. A summary plot of a SHAP interaction value matrix is shown in Figure 5 wherein the diagonal reflects the main effects, while across the diagonal show interaction effects. The explainable model will produce a high probability when it is confident about a decision, resulting in larger relevance scores due to the availability of more relevance for distributing backward. On the contrary, the model will output a lower probability when it is less confident about the patient to develop sepsis and as a result, yields lower relevant scores. This summary of scores distribution assists the clinicians with the hints to what to be expected from the designed model for clinical practice.
This study justifies the clinical significance of the derived physiological inter-relations among the clinical signs via feature importance and SHAP plots for visualized interpretation. Though SHAP values cannot be used as a generalized approach for early prediction of sepsis, they certainly help in generating relevant clinical hypotheses for desired events. The SHAP illustrations indeed assist in mitigating the concerns of the black-box issue associated with prediction models and might assist clinicians with a better understanding of the important features of the xMLEPS framework. The the proposed framework has the ability to establish the significance of the individual features contributing to enhance prediction of the utility score. Thus ensuring interpretablity of the framework to its clinical users. Furthermore, the proposed prediction framework, deploying clinical ICU data in the routine practice care can be potentially integrated into a computerized clinical decision support system instead of employing advanced molecular biomarkers.
The recent research literature relevant to early diagnosis of sepsis comes from the articles of various submission entries to PhysioNet Challenge 2019 . This challenge aimed at the design and development of algorithms for early and automated prediction of sepsis onset with the optimal window definition of six hours before the actual clinical recognition of disease onset. The predictions of the machine learning algorithms were rewarded if they were able to detect true positives correctly up to 12 hours before disease onset and were slightly penalized if they were false positive. However predictions were strongly penalized if they were incorrect near disease onset. The reason for choosing the optimal prediction window to be six hours comes from the clinical fact that the ratio of observed median time to antimicrobial therapy is found to be 6 hours . Furthermore, delay in each hour of treatment results in average decrease of survival rate of 7.6% .
The comparative analysis of the results obtained by the proposed method with our previous works [19, 38] and submission approaches [39, 40, 41, 42, 43, 44, 45, 46] that reported the best results in the PhysioNet 2019 Challenge  is listed in Table 4.Most of these approaches utilized 5 or 10 fold cross-validation scheme and yielded utility scores in the range of 0.36–0.45.
|Chang et al. ||Temporal Convolutional Networks (TCN)||—||0.4170|
|Li et al. ||A Time-phased model||—||0.4300|
|Morrill et al. ||A signature transform-based model||—||0.4340|
|Zabihi et al. ||XGBoost Ensemble models||0.8333||0.4280|
|Yang et al. ||Fusion-based XGBoost||0.8400||0.4300|
|Du et al. ||Gradient Boosting Scheme||0.8630||0.4090|
|Lee et al. ||Graph Convolutional Networks (GCN)||0.8170||0.3820|
|Lyra et al. ||Using Random forest classifier||0.8100||0.3760|
|Nesaragi and Patidar ||Ratio and Power-based features||0.8432||0.4013|
|Nesaragi et al. ||PMI-based Tensor factorization||0.8621||0.4519|
This study supports the usage of the Utility score as an effective metric on ICU data for sepsis onset. However, experiments showed that even the F1 score gave reliable results aligning with utility score. i.e. the increase and decrease of F1 score follow accordingly with the Utility score. However, the bounds for utility score vary from −2 to 1 whereas the F1 Score has bounds from 0 to 1. The other conventional metrics namely AUROC, AUPRC, and Accuracy are insignificant to use with such a highly unbalanced dataset and are misleading for sepsis onset. Further, the fact that the interpretation of these results together with utility score is quite difficult cannot be ignored as mentioned by Roussel et al. .
The limitation of this study is, it constrains only to a two-center cohort design from the available training data, which might create doubt that the trained models may get over-fit towards the particular cohort data and it’s patient-records. However, the analyzed ICU patient admissions originate from a diverse population covering the entire spectrum of ICU patients, and further, the validation in terms of inter-cohorts train-test approach along with optimum threshold refinement demonstrates the deployment of our framework in other ICUs.
This study presents xMLEPS – an explainable machine learning framework for the early prediction of sepsis using clinical data in the ICU setting. These predictive explanations justify the clinical significance of physiological inter-relations among the given clinical signs via visualized interpretation. And thus assist the clinicians in decision making for diagnosis and recommend future actions to be taken to improve the quality of predictions. This certainly ensures that the data-driven automated ML models have the potential to make the paradigm shift from conventional detection and treatment to an automated early prediction that prevents the failure of the organ system due to sepsis.
Conflict of interest
The authors declare no conflict of interest.