Open access peer-reviewed chapter - ONLINE FIRST

An Explainable Machine Learning Model for Early Prediction of Sepsis Using ICU Data

By Naimahmed Nesaragi and Shivnarayan Patidar

Submitted: February 23rd 2021Reviewed: June 17th 2021Published: July 23rd 2021

DOI: 10.5772/intechopen.98957

Downloaded: 30


Early identification of individuals with sepsis is very useful in assisting clinical triage and decision-making, resulting in early intervention and improved outcomes. This study aims to develop an explainable machine learning model with the clinical interpretability to predict sepsis onset before 6 hours and validate with improved prediction risk power for every time interval since admission to the ICU. The retrospective observational cohort study is carried out using PhysioNet Challenge 2019 ICU data from three distinct hospital systems, viz. A, B, and C. Data from A and B were shared publicly for training and validation while sequestered data from all three cohorts were used for scoring. However, this study is limited only to publicly available training data. Training data contains 15,52,210 patient records of 40,336 ICU patients with up to 40 clinical variables (sourced for each hour of their ICU stay) divided into two datasets, based on hospital systems A and B. The clinical feature exploration and interpretation for early prediction of sepsis is achieved using the proposed framework, viz. the explainable Machine Learning model for Early Prediction of Sepsis (xMLEPS). A total of 85 features comprising the given 40 clinical variables augmented with 10 derived physiological features and 35 time-lag difference features are fed to xMLEPS for the said prediction task of sepsis onset. A ten-fold cross-validation scheme is employed wherein an optimal prediction risk threshold is searched for each of the 10 LightGBM models. These optimum threshold values are later used by the corresponding models to refine the predictive power in terms of utility score for the prediction of labels in each fold. The entire framework is designed via Bayesian optimization and trained with the resultant feature set of 85 features, yielding an average normalized utility score of 0.4214 and area under receiver operating characteristic curve of 0.8591 on publicly available training data. This study establish a practical and explainable sepsis onset prediction model for ICU data using applied ML approach, mainly gradient boosting. The study highlights the clinical significance of physiological inter-relations among the given and proposed clinical signs via feature importance and SHapley Additive exPlanations (SHAP) plots for visualized interpretation.


  • sepsis
  • early prediction
  • machine learning
  • explainable AI
  • electronic health records
  • clinical informatics
  • critical care
  • model-based diagnosis

1. Introduction

Sepsis is an enigmatic clinical condition that occurs when the patient’s body reacts adversely to infection and as a consequence develops organ dysfunction. Sepsis can practically affect all organ systems however, the organs involved and the degree of dysfunction varies distinctly among patients and can even lead to death in most cases [1, 2]. In the early stages of the disease, the treatment of sepsis seems to be relatively easy with the availability of broad-spectrum antibiotics [3]. While in the later stages of the disease, diagnosis of sepsis becomes much easier but extremely hard to treat. Therefore, early diagnosis of sepsis is the need of the hour for better clinical management [4].

Current manual assessment of sepsis using screening tools, like the Sequential Organ Failure Assessment (SOFA) score for ICU-patients, are complex in terms of measured clinical signs and even lack adequate sensitivity [5, 6]. On the other hand, AI and machine learning-based automated clinical decision support systems that use easily accessible clinical data have reflected a significant improvement in agreement with these treatment protocols in ICUs by guiding physicians through predefined work-flows [7, 8, 9, 10, 11]. In the current era wherein we have abundant availability of electronic medical records (EMRs) has brought more feasibility to such automated realizations [12]. However, almost every machine learning (ML)-based AI model and automated decision support system lack proper explainability because of their uninterpretable black-box nature [13, 14]. This is where Explainable Artificial Intelligence (XAI) comes in rescue to address some of these restrictions imposed by a Black-box AI system by adding explainability. And thus assist clinicians in the interpretation of their diagnosis, and recommend future actions to be taken thereby improving the quality of predictions [15, 16, 17]. The development of such an explainable ML framework for sepsis onset prediction is an important and active area of investigation.

This work presents a novel clinical application of developing an explainable ML framework for sepsis onset prediction among ICU patients based on the physiological medical knowledge of given clinical signs, obtained via extensive analysis, and using popular gradient boosting ML techniques. The framework’s design includes an optimal explainable gradient boosting architecture for clinical decision making that investigates questions of generalizability and interpretability of the proposed system.


2. Methods

An overview of the proposed methodology from raw data to explainable decision framework is shown in Figure 1.

Figure 1.

Graphical overview: From given raw clinical data to explainable decision framework. (a & b) clinical data from two ICU cohorts is imputed. (c) Physiological inter-relations and time lag differences are computed as features. (d) an optimal sepsis onset prediction architecture is developed using LightGBM models via bayesian optimization. (e) the predictions are rendered to explanations and potentially their predictive power is increased by refining the threshold that drove the prediction at every time-point. (f) Final decision.

2.1 Dataset and study population

The publically available training set consists of data from two cohorts [18]. Cohort A has 790,215 records of 20,336 patients. Cohort B has 761,995 records of 20,000 patients. Particularly, data for every patient record contains 40 clinical covariates i.e. 8 vital signs, 26 laboratory values, and 6 demographic values. The labeling of the patient data was done adhering to Sepsis-3 clinical criteria. Table 1 presents the details of various clinical covariates used under study together with their missing information in percentage [18, 19].

Sr. no.CovariatesMissing values (%)Units
1Heart rate9.8beats/min
4SBP14.5mm Hg
5MAP12.45mm Hg
6DBP31.34mm Hg
8EtCO296.28mm Hg
9Excess bicarbonate95.57mmol/L
13PaCO294.44mm Hg
15Asparatate transaminase98.37IU/L
17Alkaline phosphatase98.39IU/L
21Direct bilirubin99.8mg/dL
23Lactic acid97.32mg/dL
27Total bilirubin98.50mg/dL
28Troponin I99.04ng/mL
33Fibrinogen concentration99.34mg/dL
34Platelet count94.05count/mL
36GenderMale (1) or Female (0)
37Unit 139.42true (1) or false (0)
38Unit 239.42true (1) or false (0)
41SepsisLabelseptic (1) nonseptic (0).

Table 1.

Details of the various clinical variables used under study along with missing values information in percentage.

2.2 Feature extraction

Feature extraction takes place on the imputed version of given clinical data that generates features sample-wise on an hourly time grid. Two types of features ware generated namely:

Physiological features: In literature, inter-relations among the clinical values have been proven to enhance the capability of anomaly detection tasks [7, 20]. By reviewing various studies that justify the clinical significance of well-established physiological inter-relations among the given clinical signs 10 such physiological relations are derived from the given covariates: Three Shock Indices firstly the well defined Shock Index (SIndex) using Systolic BP and the other two are its modified versions proposed in this study for Diastolic BP (DPBSIndex) [21] and Mean Arterial Pressure (MAPSIndex) [22] followed by ratios BUN/Creatinine (BUNCr) [7], Bilirubintotal/Creatinine (BILTcr), SaO2/FiO2 [23], PaO2/FiO2 [24], Platelets/Age (PlaAge), the difference between SBP and DBP called Pulse Pressure (PP) [25], and lastly Cardiac Output (CO) [26]. Table 2 gives a detailed description of Physiological features.

Sl. noAbbreviationDescriptionFormula
1SIndexShock Index (SIndex) is the proportion of heart rate (HR) being divided by systolic blood pressure (SBP), normalized by age.(HR/SBP)*Age
2DBPSIndexDiastolic Shock Index is the proportion HR being divided by systolic blood pressure (DBP), normalized by age.(HR/DBP)*Age
3MAPSIndexIt is defined as the proportion of HR being divided by Mean Arterial Pressure (MAP), normalized by age.(HR/MAP)*Age
4BUNCrIt is the ratio of Blood Urea Nitrogen(BUN) to CreatinineBUN/Creatinine
5BILTCrIt is the ratio of Direct Bilirubin (Bilirubin_total) to CreatinineBilirubin_total/Creatinine
6SaO2 -FiO2It is the ratio of oxygen saturation of arterial blood in percentage (SaO2) to the fraction of inspired oxygen (FiO2).SaO2/FiO2
7PaO2 -FiO2It is defined as proportion of the partial pressure of oxygen PaO2 divided by the fraction of inspired oxygen (FiO2).PaO2/FiO2
8Pla_AgeIt is the ratio of platelets to agePlatelets/Age
9PPPulse Pressure (PP) is the difference between SBP and DBPSBP-DBP
10COCardiac Output is the product of pulse pressure (PP) and HR.PP * HR

Table 2.

Detailed definitions of the physiological features.

Time-Lag difference features: A set of 35 time-lag features are computed with 6 hours of time-lag difference among vital signs and lab values from the given 40 clinical variables excluding the last 5 demographic values.

Finally, the obtained 45 features are combined with the given 40 clinical signs, thereby increasing the final feature count to 85 features. The resultant feature set is then fed to train the proposed xMLEPS framework.

2.3 Implementation of xMLEPS

Together with Bayesian optimization and the refinement of prediction risk threshold an optimal disease onset detection method before six hours for sepsis called xMLEPS is developed. As shown in Figure 1 the given clinical sepsis data has large amount of missing information (approximately 20%). So at the onset of the algorithm computation, filling of these missing values is carried out as as a pre-processing step. The data imputation to fill in the missing values is done by employing forward fill imputation on the given EHR clinical data. In the real-time scenario, the current missing values encountered are to be filled with previous available measurements. Thus only the previous clinical values of given EHR data are fetched for data imputation of current observation.

In this study, imputation is carried out into two rounds: first local imputation, for each individual record, and then global imputation for all the combined records together. In the case of local imputation, the trailing missing values in a row for a particular clinical covariate (or feature vector) are forward filled with the nearest past non-missing value in that row locally for the given record. Ipso facto, if the record encounters ‘NaN’ values, in the beginning, i.e. for the first alone measurement at t = 0, they are retained as it is initially and then later replaced with ‘global mean’ for that covariate row obtained by combining all records [19].

During model development, a ten-fold cross-validation scheme is employed wherein 10 LightGBM classifiers with the same complexity of model hyper-parameters obtained during Bayesian optimization are developed for the corresponding fold. The total feature set used to develop these models comprises of 85 features as described in sub-Section 2.2. Generally, hyper-parameter optimization aims at looking for the best hyper-parameter values to minimize the objective loss function. The hyper-parameter settings maximizing the custom-defined challenge metric- utility score on the subset of training data during the Bayesian optimization phase are later used to build models. These built models generate the predictions on the hold out 10% of validation data in each fold. The training process of the model in each fold stops when the utility score of the validation set does not show further improvements over 32 consecutive iterations, i.e. early stopping to best iteration is achieved to reset the model and thereby to avoid over-fitting.

The initial predictions generated by each optimal model on the corresponding validation data of each fold undergo refinement of the prediction risk threshold to enhance the utility score. The search space for the prediction risk thresholds lies in the range of 0 to 1 and is varied in steps of 0.05. Thus the threshold search space has 20 values. So the initial predictions of validation data of each fold are compared with each of these 20 values. After comparison, the threshold value that gives the maximum utility score for the set of predictions of that fold is said to be optimal. Such 10 optimum threshold values are later used by the corresponding models to refine the predictive power in terms of utility score for generated labels in each fold.

This LightGBM based gradient boosting framework serves with a specific processing method for sparse data which is important in our classification task with class imbalance problem [27]. For the interpretability of the proposed framework, the LightGBM uses its feature importance attribute to quantify each variable, and the explainability component is addressed by employing SHAP summary and dependency plots wherein the distribution of the variable importance is illustrated [28, 29].

3. Results

The proposed framework performs the prediction from the given patient-records to determine the risk of development of sepsis onset in the next 6 hours. This is achieved using a continuous-valued utility score as defined by challenge organizers for each prediction [18]. The utility function rewards or penalizes classifiers for their predictions within 12 hours before and 3 hours after sepsis onset time and was normalized as described in [18]. Using a ten-fold cross-validation scheme 10 LightGBM models are designed based on patient-wise stratified ten-folds each containing unique 10% of the entire training set. The hyper-parameters of the above models that minimize cross-validation loss are obtained by using automatic hyper-parameter optimization utility ‘bayesopt’ in Python [30, 31]. The underlying objective function formulated for the optimization is intended to maximize the AUROC. The given software utility finds optimal parameters automatically using Bayesian optimization. At the outset, the optimized models includes: 60 ‘num_leaves’, 120 ‘min_data_in_leaf’, ‘max_depth’ of 2, ‘learning_rate’ of 0.01, ‘scale_pos_weight’ of 20, ‘min_samples_split’ of 4.

Table 3 gives a summary of the results by the proposed framework on the entire training data in a ten-fold cross-validation scheme. Results also include performances of inter-cohort and baseline studies. To ensure that the models trained in the proposed study learn dependencies not only between the patient-records but also among the cohorts, we considered inter-cohort training and testing scheme. i.e. model trained with the data of cohort A was scored on cohort B data and vice versa. This certainly avoids the doubt of the over-fitting, thus increasing the robustness of the framework. Inter-cohort scores for A and B were 0.3191 and 0.3284 respectively.

Average (Std)0.8591 (0.0085)0.1502 (0.0286)0.4214 (0.0148)
Baseline 10.85600.15170.3870
Baseline 20.85020.13760.3509
Baseline 30.81240.11970.3198
xMLEPSSet A (Training) and Set B (Test)0.3191
xMLEPSSet B (Training) and Set A (Test)0.3284

Table 3.

Results summary of the proposed framework.

3.1 Comparison of xMLEPS with baseline

Further, to emphasize the clinical relevance of the derived features under this proposed method, a comparative analysis of results is done by carrying out three baseline studies as shown in Figure 2.

Figure 2.

Comparison of results by xMLEPS with the three base-line studies. US: Utility score, F1: F1 score.

As a part of comparative analysis three well-tuned baseline studies are performed: Firstly, the proposed method with feature set of 85 features is tested without optimal threshold refinement (default threshold value of 0.5 with no skill is used) in a 10-fold cross-validation scheme. In the second and third methods, the given 40 clinical variables only are directly fed to LightGBM models with and without refinement of optimal threshold respectively in a 10-fold cross-validation scheme. Table 3 presents the results of these three baseline studies accordingly. As expected the proposed method xMLEPS outperforms these three studies. The third study carried out without derived features and optimal threshold refinement shows worst performance. Even for the first baseline study, results are significantly lower by 3% in terms of the utility score as compared to the proposed method.

3.2 Explanation and visualization of feature importance

The cumulative feature importance of the first top 50 features is shown in Figure 3. Here the LightGBM feature importance attribute is used for the gradient boosting framework developed. The approach used is to count the number of times a feature gets involved to split the dataset across all trees. The failure of such an approach is that it accounts for different impacts due to different splits. The next best approach is to attribute the gain achieved with the reduction in average training loss when using a feature for splitting. This “Gain” measure used for feature importance recovers the correct mutual information between feature inputs and label outputs [32]. The limitation of this approach is that it gets easily biased when greedy trees are built in the finite ensembles. So other methods are designed to compensate for the bias in feature selection using gain approach [33, 34].

Figure 3.

Cumulative feature importance of the top 50 features using the feature importance attribute of LightGBM.

SHAP summary plot with the 20 most important clinical features that cause sepsis onset identified by the xMLEPS framework is shown in Figure 4(a). Here the approach used for the feature importance is to sort all the relevance scores across the entire population in decreasing order of mean relevance as computed for local, but considering only those individuals who were positive for sepsis. The mean relevance is displayed as blue horizontal bars in Figure 4(a). While local explanations summary is shown in Figure 4(b), wherein all the individual data points are displaced by mean relevance for sepsis and are colored by feature values. As shown from Figure 4(b) we can draw that the increase in the length of stay (ICULOS) and higher value of clinical ratio’s like PaO2/FiO2, Shock indices: DBPSIndex and SIndex, etc. leads to the development of sepsis, whereas lower Platelets, DBP and Magnesium levels cause sepsis. These findings are found to be consistent with previous studies on it [7, 21, 35, 36].

Figure 4.

Results from the SHAP explanation module showing the global feature importance together with local explanation summary. PaO2: partial pressure of oxygen. FiO2: fraction of inspired oxygen. HR: Heart Rate. DBP: Diastolic Blood Pressure. SBP: Systolic Blood Pressure. SIndex: Shock Index. DBPIndex: Diastolic Blood Pressure Shock Index. PaCO2: Partial pressure of carbon dioxide. PTT: Partial thromboplastin time. WBC: Leukocyte count. BUN: Blood Urea Nitrogen.

Further, the impact of each feature and the interactions among them for sepsis development can also be illustrated using SHAP dependency plots. As an example, in Figure 4(c) the dependency plot showing the interaction of Heart rate with ICULOS is depicted. As seen the xMLEPS model seems to associate high heart rate values in the range 120–180 with increased ICULOS and hence causing sepsis. Further Figure 4(d) shows lower values of SBP (approx. Below 90) is associated with increase ICULOS causing sepsis. A summary plot of a SHAP interaction value matrix is shown in Figure 5 wherein the diagonal reflects the main effects, while across the diagonal show interaction effects. The explainable model will produce a high probability when it is confident about a decision, resulting in larger relevance scores due to the availability of more relevance for distributing backward. On the contrary, the model will output a lower probability when it is less confident about the patient to develop sepsis and as a result, yields lower relevant scores. This summary of scores distribution assists the clinicians with the hints to what to be expected from the designed model for clinical practice.

Figure 5.

Summary plot of a SHAP interaction value matrix.

4. Discussion

This study justifies the clinical significance of the derived physiological inter-relations among the clinical signs via feature importance and SHAP plots for visualized interpretation. Though SHAP values cannot be used as a generalized approach for early prediction of sepsis, they certainly help in generating relevant clinical hypotheses for desired events. The SHAP illustrations indeed assist in mitigating the concerns of the black-box issue associated with prediction models and might assist clinicians with a better understanding of the important features of the xMLEPS framework. The the proposed framework has the ability to establish the significance of the individual features contributing to enhance prediction of the utility score. Thus ensuring interpretablity of the framework to its clinical users. Furthermore, the proposed prediction framework, deploying clinical ICU data in the routine practice care can be potentially integrated into a computerized clinical decision support system instead of employing advanced molecular biomarkers.

The recent research literature relevant to early diagnosis of sepsis comes from the articles of various submission entries to PhysioNet Challenge 2019 [18]. This challenge aimed at the design and development of algorithms for early and automated prediction of sepsis onset with the optimal window definition of six hours before the actual clinical recognition of disease onset. The predictions of the machine learning algorithms were rewarded if they were able to detect true positives correctly up to 12 hours before disease onset and were slightly penalized if they were false positive. However predictions were strongly penalized if they were incorrect near disease onset. The reason for choosing the optimal prediction window to be six hours comes from the clinical fact that the ratio of observed median time to antimicrobial therapy is found to be 6 hours [37]. Furthermore, delay in each hour of treatment results in average decrease of survival rate of 7.6% [37].

The comparative analysis of the results obtained by the proposed method with our previous works [19, 38] and submission approaches [39, 40, 41, 42, 43, 44, 45, 46] that reported the best results in the PhysioNet 2019 Challenge [18] is listed in Table 4.Most of these approaches utilized 5 or 10 fold cross-validation scheme and yielded utility scores in the range of 0.36–0.45.

ReferenceMethodologyAUROCUtility Score
Chang et al. [42]Temporal Convolutional Networks (TCN)0.4170
Li et al. [45]A Time-phased model0.4300
Morrill et al. [39]A signature transform-based model0.4340
Zabihi et al. [40]XGBoost Ensemble models0.83330.4280
Yang et al. [41]Fusion-based XGBoost0.84000.4300
Du et al. [46]Gradient Boosting Scheme0.86300.4090
Lee et al. [43]Graph Convolutional Networks (GCN)0.81700.3820
Lyra et al. [44]Using Random forest classifier0.81000.3760
Nesaragi and Patidar [38]Ratio and Power-based features0.84320.4013
Nesaragi et al. [19]PMI-based Tensor factorization0.86210.4519

Table 4.

Summary of the results obtained by our previous works and the submitted solutions to PhysioNet 2019 challenge under 5/10-fold cross-validation scheme using training data.

This study supports the usage of the Utility score as an effective metric on ICU data for sepsis onset. However, experiments showed that even the F1 score gave reliable results aligning with utility score. i.e. the increase and decrease of F1 score follow accordingly with the Utility score. However, the bounds for utility score vary from −2 to 1 whereas the F1 Score has bounds from 0 to 1. The other conventional metrics namely AUROC, AUPRC, and Accuracy are insignificant to use with such a highly unbalanced dataset and are misleading for sepsis onset. Further, the fact that the interpretation of these results together with utility score is quite difficult cannot be ignored as mentioned by Roussel et al. [47].

The limitation of this study is, it constrains only to a two-center cohort design from the available training data, which might create doubt that the trained models may get over-fit towards the particular cohort data and it’s patient-records. However, the analyzed ICU patient admissions originate from a diverse population covering the entire spectrum of ICU patients, and further, the validation in terms of inter-cohorts train-test approach along with optimum threshold refinement demonstrates the deployment of our framework in other ICUs.

5. Conclusion

This study presents xMLEPS – an explainable machine learning framework for the early prediction of sepsis using clinical data in the ICU setting. These predictive explanations justify the clinical significance of physiological inter-relations among the given clinical signs via visualized interpretation. And thus assist the clinicians in decision making for diagnosis and recommend future actions to be taken to improve the quality of predictions. This certainly ensures that the data-driven automated ML models have the potential to make the paradigm shift from conventional detection and treatment to an automated early prediction that prevents the failure of the organ system due to sepsis.


Conflict of interest

The authors declare no conflict of interest.


chapter PDF

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Naimahmed Nesaragi and Shivnarayan Patidar (July 23rd 2021). An Explainable Machine Learning Model for Early Prediction of Sepsis Using ICU Data [Online First], IntechOpen, DOI: 10.5772/intechopen.98957. Available from:

chapter statistics

30total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us