Patient characteristics of the baseline dataset.
We propose a method for quantitative analysis of predictive power of laboratory tests and early detection of mortality risk by usage of predictive models and feature selection techniques. Our method allows automatic feature selection, model selection, and evaluation of predictive models. Experimental evaluation was conducted on patients with renal failure admitted to ICUs (medical intensive care, surgical intensive care, cardiac, and cardiac surgery recovery units) at Boston’s Beth Israel Deaconess Medical Center. Data are extracted from Multi parameter Intelligent Monitoring in Intensive Care III (MIMIC-III) database. We built and evaluated different single (e.g. Logistic regression) and ensemble (e.g. Random Forest) learning methods. Results revealed high predictive accuracy (area under the precision-recall curve (AUPRC) values >86%) from day four, with acceptable results on the second (>81%) and third day (>85%). Random forests seem to provide the best predictive accuracy. Feature selection techniques Gini and ReliefF scored best in most cases. Lactate, white blood cells, sodium, anion gap, chloride, bicarbonate, creatinine, urea nitrogen, potassium, glucose, INR, hemoglobin, phosphate, total bilirubin, and base excess were most predictive for hospital mortality. Ensemble learning methods are able to predict hospital mortality with high accuracy, based on laboratory tests and provide ranking in predictive priority.
- mortality risk prediction
- renal failure
- metabolic panel
- feature selection
- ensemble methods
Precision medicine is based on comprehensive models with the potential to elucidate the complexity of health and diseases, including the features of emergence, nonlinearity, self-organization, and adaptation .
Laboratory testing is more common among patients admitted to ICU [2, 3]. Blood sample frequencies vary, but routinely tests are ordered by fixed schedule and in clusters as part of the hypothetico-deductive diagnostic exploration. Quantitative predictive analysis of daily sampling might provide new insights into the choice (feature selection) and importance (feature weighting) of each laboratory test . In this chapter, we propose a system for mortality risk prediction of patients with renal failure, based on predictive methods. Renal failure patients were selected based on the Elixhauser Comorbidity Index . For chronic disease, the use of Elixhauser is sensitive for the systemic underrepresentation of chronic conditions [6, 7].
This study quantitatively assessed the predictive power of laboratory tests for hospital mortality in patients admitted to ICU. Based on previous findings, we compared the predictive performance of different single (Decision Tree, Naive Bayes, Logistic and Regression) and ensemble (Random Forest, Boosting, and Bagging) learning methods. Moreover, the predictive power and importance of predictors (laboratory tests) were quantitatively assessed by use of feature weighting and selection techniques: Correlation, Gini Selection, Information Gain and ReliefF . For predictive modeling, feature selection, and visual analytics of the results, we used RapidMiner and R platforms as mentioned in [9, 10, 11].
2. Materials and methods
2.1. Data source and study subjects
The MIMIC-III (version 1.0) clinical database consists of 58,976 ICU admissions for 46,520 distinct patients, admitted to Beth Israel Deaconess Medical Center (Boston, MA) from 2001 to 2012 [12, 13]. The establishment of the database was approved by the Institutional Review Boards of the Massachusetts Institute of Technology (Cambridge, MA) and Beth Israel Deaconess Medical Center (Boston, MA). Accessing the database was approved for authors S.V.P and Z.Z. (certification number: 1712927 and 1132877). Informed consent was waived due to observational nature of the study.
The MIMIC-III clinical database includes data related to patient demographics, hospital admissions and discharge dates, room tracking, death dates (in or out of the hospital), ICD-9 codes, health care providers, and types. All dates were surrogate dates but time intervals were preserved. In addition, physiological data, medications consumption, laboratory investigations, fluid balance calculations and notes, and reports were included in the basic dataset.
2.2. Data preparation
The dataset in this study was generated by joining data from the following MIMIC-III tables: admission, patients, ICU stays, diagnoses_icd, and lab events. Patients were assigned to subpopulations including hypertension, paralysis, chronic pulmonary disease, diabetes, renal failure, acquired immunodeficiency syndrome (AIDS), coagulopathy, obesity, and weight loss, and so on, based on the Elixhauser comorbidity score [6, 7]. Renal failure is defined in the Elixhauser comorbidity score, when ICD-9 code is in (70.32, 70.33, 70.54, 456, 456.1, 456.2, 456.21, 571, 571.2, 571.3, ≥ 571.4 ≤ 571.49, 571.5, 571.6, 571.8, 571.9, 572.3, 572.8, and 42.7).
All time stamped measurements in MIMIC-III were zeroed in reference to the moment of hospital admission.
2.3. Predictive algorithms
The process compared different learning and ensemble methods (Decision Stump, Decision Tree, Naive Bayes, Logistic Regression (LR), Random Forest, Support Vector Machine, AdaBoost, Bagging, and Stacking) in association with feature weighting and selection, quantitatively assessed in terms of Correlation, Gini Selection, and Information Gain and ReliefF as previously described .
2.3.1. Single learning methods
Decision trees (DT) are predictive algorithms based on “greedy,” top-down recursively partitioning of data. DT algorithms perform an exhaustive search over all possible splits in every recursive step. The attribute (predictor) demonstrating the best split by an evaluation measure selected for branching the tree. Regularly used are information theoretic measures (e.g. Information Gain, Gain Ratio, Gini, etc.) or statistical tests quantifying the significance of the association between predictors and class. The procedure is recursively iterated until a stop criterion is met [15, 16]. In this research, we used the J48 algorithm, which is the Java implementation of the C4.5 algorithm .
Logistic regression (LR) is a linear classifier modeling the probability of a dependent binary variable y given a vector of independent variables X. For the estimation of the probability, the example belongs to the positive class, a logit model is used:
where p presents probability that y = 1, θj, j = 1,…,n present the weights of the corresponding dependent variable, while p/(1-p) is called odds ratio, parameters θj, j = 1,…,n of the model can be interpreted as changes in log odds or the results can be interpreted in terms of probabilities [18, 19, 20].
2.3.2. Ensemble learning methods
Ensemble (meta-learning) methods combine multiple models aiming to provide more accurate or more stable predictions. These models can be aggregated from the same model built on different sub-samples of data, from different models built on the same sample or a combination of the previous two techniques. Ensemble methods are often used to improve the individual performance of algorithms that constitute ensembles by exploiting the diversity among the models produced . The ensemble methods implemented in this chapter are: Random Forest , Boosting , and Bootstrap Aggregating (Bagging) . In our experiments, Boosting and Bagging used J4.8 and Logistic regression as base learners.
Random Forest (RF) is an ensemble classifier that evaluates multiple DT and aggregates their results, by majority voting, in order to classify an example . There is a two-level randomization in building these models. First, each tree is trained on a bootstrap sample of the training data and second, in each recursive iteration of building a DT (splitting data based on information potential of features); a subset of features for evaluation is randomly selected. In this research, we grew and evaluated Random Forest (RF) with 10 trees.
Boosting is an ensemble meta-algorithm developed in order to improve supervised learning performance of weak learners (models whose predictive performance is only slightly better than random guessing). In this study, the adaptive boosting (AdaBoost) algorithm was used .
Bagging algorithm builds a series of models (e.g. CHAID Decision Trees) on different data subsamples (with replacement) . For new examples, each model is applied, and predictions are aggregated (e.g. majority voting for classification or average prediction for regression).
2.4. Feature weighting and selection
Several filter feature selection schemes were evaluated. Filter selection (FS) methods rely on the evaluation of the information potential of each input feature in relation to the label (hospital mortality). A threshold search and selection of those features, providing most predictive power, was calculated for each predictive model. The first is based on Pearson correlation returning the absolute or squared value of the correlation as attribute weight. Furthermore, we applied Information Gain Ratio and Gini Index, two weighting schemes that are based on information theoretic measures, frequently used with decision trees for evaluation of potential splits . The T-test calculated, for each attribute, a p-value for two-sided, two-sample T-test. Finally, the ReliefF evaluated the impact of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class .
3. Experimental evaluation
3.1. Exploratory analyses
The MIMIC-III database consists of 58,976 hospital admissions from 46,520 patients. All patients are characterized by at least one ICU admission.
Guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM), the ETL process was initiated by retrieving data from the MIMIC-III tables of interest (d_labitems, admissions, patients, and diagnosis_icd) . Next, patients were selected for renal failure by the Elixhauser score, leading to 1477 (3.15%) patients satisfying our inclusion criteria and 20,068 patient days (examples) in total. In a consecutive step, admissions were joined based on hospital admission id (hadm_id) with all laboratory tests (from the 755 item ids in d_labitems), aggregated on a daily level. Mean, standard deviation, and the number of tests per day (len) were defined as aggregation functions. As an output feature (label), this study focused on hospital mortality (hospital_expire_flag). From all renal failure patients in this study, 399 (27.0%) did not survive during hospital admission. Next, data were split per day in order to examine feature selection and weight changes over time. Therefore, we arbitrarily limited our computations for admission duration of 7 days, where for each day the number of patients was >1000. After that period, the number of patients admitted to ICU declined.
Patients who survived hospital stay were significantly older (69.3 ± 12.4 years vs. 65.9 ± 14.1 years; p < 0.05), suffered more frequently from deficiency anemia (15.5 vs. 9.8% p = 0.01) and depression (8.3 vs. 3.8% p = 0.00). The survivors suffered less frequently from congestive heart failure (40.2 vs. 46.9% p = 0.02), valvular disease (9.8 vs. 14.3% p = 0.01), lymphoma (1.6 vs. 3.8% p = 0.01), and metastatic cancer (1.7 vs. 4.5% p = 0.00). Table 1 displays the basic characteristics of the baseline dataset. Binary variables are reported as prevalence percentages or count, and continuous variables are reported as data mean ± standard deviation.
|Characteristics||ICU patients with renal failure (n = 1477)||Survival during hospital admission (n = 1078) (73.0%)||Death during hospital admission (n = 399) (27.0%)||p|
|Age (years)||66.8 ± 13.8||69.3 ± 12.4||65.9 ± 14.1||<0.05|
|Sex (male, %)||60.1||59.4||61.9|
|Congestive heart failure||620 (42.0)||433 (40.2)||187 (46.9)||0.02|
|Cardiac arrhythmias||438 (29.7)||307 (28.5)||131 (32.8)||0.10|
|Valvular disease||163 (11.0)||106 (9.8)||57 (14.3)||0.01|
|Pulmonary circulation||95 (6.4)||68 (6.3)||27 (6.8)||0.75|
|Peripheral vascular disease||274 (18.6)||199 (18.5)||75 (18.8)||0.88|
|Hypertension||13 (0.9)||8 (0.7)||5 (1.3)||0.35|
|Paralysis||19 (1.3)||13 (1.2)||6 (1.5)||0.65|
|Other neurological||70 (4.7)||46 (4.3)||24 (6.0)||0.16|
|Chronic pulmonary disease||253 (17.1)||188 (17.4)||65 (16.3)||0.60|
|Diabetes uncomplicated||322 (21.8)||225 (20.9)||97 (24.3)||0.15|
|Diabetes complicated||435 (28.5)||326 (30.2)||109 (27.3)||0.27|
|Hypothyroidism||155 (10.5)||120 (11.1)||35 (8.8)||0.19|
|Renal failure||1477 (100)||1078 (100.0)||399 (100.0)|
|Liver disease||76 (5.1)||51 (4.7)||25 (6.3)||0.24|
|Peptic ulcer||12 (0.8)||10 (0.9)||2 (0.5)||0.42|
|Aids||19 (1.3)||14 (1.3)||5 (1.3)||0.94|
|Lymphoma||32 (2.2)||17 (1.6)||15 (3.8)||0.01|
|Metastatic cancer||36 (2.4)||18 (1.7)||18 (4.5)||0.00|
|Solid tumor||68 (4.6)||46 (4.3)||22 (5.5)||0.31|
|Rheumatoid arthritis||44 (3.0)||31 (2.9)||13 (3.3)||0.70|
|Coagulopathy||149 (10.1)||106 (9.8)||43 (10.8)||0.59|
|Obesity||34 (2.3)||28 (2.6)||6 (1.5)||0.21|
|Weight loss||60 (4.1)||37 (3.4)||23 (5.8)||0.04|
|Fluid electrolyte||572 (38.7)||414 (38.4)||158 (39.6)||0.68|
|Blood loss anemia||0 (0.0)||0||0||0|
|Deficiency anemias||206 (13.9)||167 (15.5)||39 (9.8)||0.01|
|Alcohol abuse||41 (2.8)||29 (2.7)||12 (3.0)||0.74|
|Drug abuse||25 (1.7)||19 (1.8)||39 (1.5)||0.73|
|Psychoses||42 (2.8)||32 (3.0)||10 (2.5)||0.63|
|Depression||105 (7.1)||90 (8.3)||15 (3.8)||0.00|
In Figure 1, distributions of numbers of laboratory tests by admission days are described by the box plots for each day demonstrating a decline in the number of different laboratory tests requested by admission days from day 1 to day 4. For the following days, the number of requested laboratory tests was stable.
3.2. Automatic model building, feature selection and evaluation
A more detailed technical description of the use of RapidMiner for scalable predictive analytics of medical data, as well as templates of generic processes, can be found in  and its supplementary materials.
Initially, all features are weighted by five feature weighting and selection methods (Information Gain ratio, Gini, Correlation, ReliefF, and T-test), for each day. In order to find the adequate number of features that will be used by each predictive model for each day (and to identify optimal feature selection methods for our data), we conducted the following procedure. First, we sorted the features by their weights in descending order (for each feature weighting method). Then we trained each of five predictive models (Decision tree, Logistic regression, Random Forest, Bagging, and Boosting) on subsets of features with highest weights, starting from 10 features up to 100 with the step of 10 (9 different feature sets) [27, 28]. Even though a number of experiments were conducted (315 experiments: 7 algorithms X 5 feature selection schemes X 9 thresholds), this method as previously described  allowed ease of implementation of the experimental setup within only one RapidMiner process execution and with complete reproducibility of the results.
Evaluation of all predictive models was performed by AUPRC (area under the precision-recall curve) for model comparison, because of the unbalanced nature of data [27, 28]. Namely, frequently used area under receiver operating curve (AUROC) is calculated based on true positive rate and false positive rate. True positive rate may be high even if recall is low (situation when predictor rarely predicts positive class), and thus it is often misleading in case of imbalanced data.
Because of relatively small number of samples (between 1000 and 1500 for each day), the predictive performance of models on unseen data was estimated by a fivefold cross-validation set created by stratified sampling, preserving the initial distribution of positive and negative classes of the target attribute. This validation on relatively small samples avoids the risk of misleading interpretation of the results caused by biased selection of a test set based on one sample.
3.3. Performance and feature selection
First, we present a comparison between feature selection methods, based on maximal predictive performance (in terms of AUPRC) overall algorithms. Next, we restrict further analyses on experiments with the overall best feature selection technique. Values in Table 2 illustrate the maximal predictive performance for each day and for each feature selection technique. Maximum values by days (rows) are shown in bold. It can be seen that Gini and ReliefF achieved maximum values on all days, except for the first day of ICU admission.
|Day/FS method||Info Gain Ratio||Gini||Correlation||ReliefF||T-test|
On the first day, the maximal performance is achieved with correlation (0.48). A more detailed inspection of the model performance and the number of features selected for each admission day demonstrated that Random Forest, result in the best predictive performance overall days (except the first day). Logistic regression often achieved a good performance. J4.8. achieved the worst AUPRC performance over all days, but in synergy with the AdaBoost ensemble scheme, it provided a competitive performance with Random Forest and Logistic Regression.
Further, Table 2 illustrates that predictive performance is increasing over days and stabilizes from day 4 to 7 on AUPRC = 0.86. AUPRC values for days 2 and 3 are also high (0.83 and 0.85, respectively) and illustrate that risk for hospital mortality can be predicted, with high confidence, starting from the second day of admission.
Values of area under precision-recall curve illustrate the general performance of predictive models but do not explain anything related to the selection of the actual thresholds that should be selected for predictions. Therefore, we analyzed possible thresholds by inspecting precision-recall (PR) trade-off. High recall means that most of the positive examples (in this case, hospital death) are predicted correctly. High precision means that there is a low number of false alarming (mortality is predicted, but the patient survived). Figure 2 shows precision/recall (PR) curves for first 4 days that were generated from the predictions of the best performing models (Table 3), built on features from the Gini selection.
|Day||J4.8||Logistic||Random Forest||AdaBoost (J 4.8)||AdaBoost (Logistic)||Bagging (J4.8)||Bagging (Logistic)|
|1||0.35 (10)||0.47 (60)||0.42 (30)||0.37 (10)||0.42 (30)||0.36 (20)||0.46 (30)|
|2||0.76 (60)||0.83 (50)||0.83 (70)||0.81 (30)||0.82 (60)||0.78 (50)||0.82 (60)|
|3||0.78 (30)||0.84 (20)||0.85 (40)||0.84 (60)||0.82 (90)||0.76 (10)||0.83 (20)|
|4||0.79 (20)||0.85 (20)||0.86 (40)||0.86 (10)||0.83 (20)||0.78 (10)||0.84 (10)|
|5||0.81 (20)||0.85 (30)||0.86 (80)||0.85 (20)||0.83 (20)||0.8 (10)||0.84 (10)|
|6||0.82 (20)||0.83 (40)||0.86 (70)||0.86 (20)||0.82 (90)||0.8 (20)||0.83 (10)|
|7||0.80 (10)||0.84 (30)||0.86 (80)||0.85 (10)||0.81 (30)||0.78 (10)||0.84 (30)|
On the first day, all models resulted in poor results which are found on the upper left PR curve in Figure 2.
Highest recall can be achieved with 0.3 precision (70% of false alarms), so this model is not useful, regardless of the threshold selection. On days 2–4, maximal recall can be achieved with around 20% of false alarms. Considering the cost of false negative predictions (low recall), we argue that optimal threshold values for day 2–4 models should be between 0.8 and 1 of recall.
Further, rankings of features provided by Gini feature selection methods illustrated in Table 3 demonstrate that different algorithms achieved best predictive accuracies with different numbers of selected features. The number of features selected varied over the days.
Finally, we analyzed if there was the difference, between features selected on the first day, where the predictive performance was consistently poor, and other days, where the predictive performance was acceptable. Rank means and standard deviations were calculated for two groups: all days (with day one) and days 2–7 (without day 1) (Figure 3). Standard deviations of ranks are much higher over ranks of features that include day 1 (right parts of Figure 3). The average ranks changed (middle part of the figure), but similar laboratory tests were in first 15 ranks in both cases.
This study using ensemble methods demonstrated an improvement in predictive accuracy compared to prediction based on single models. Random Forests seem to provide the best predictive accuracy complying with our previous research . (Table 3) Random Forest also resulted in a high predictive accuracy for mortality risk prediction . This study, however, did not analyze different ensemble and feature selection methods and was conducted on a different population. In addition, the feature selection techniques Gini Index and ReliefF scored best in the majority of the cases.
Laboratory tests ranked per day based on Random Forest and Gini Index. (mean, std.: standard deviation, len: number of tests/day) indicated the importance of mean lactate values (ranked first on day 1, 2, and 8), and mean white blood cells count (on day 3, 4 and 6) in the prediction of hospital mortality. In addition, it is fascinating to observe that predictive features for hospital mortality calculated from 755 laboratory related parameters and without any additional patient-related information or medical knowledge, correlated well with the laboratory tests used on a daily basis (sodium, anion gap, chloride, bicarbonate, creatinine, urea nitrogen, potassium, glucose, INR, hemoglobin, phosphate, total bilirubin, and base excess). Laboratory-based clinical decision support may improve physician adherence to guidelines with respect to timely monitoring of chronic kidney disease [30, 31].
As demonstrated in this study, parameters for shock (lactate), sepsis (white blood cells), and multi-organ failure are more important .
The ensemble models in this study were able to generate a high predictive accuracy (AUPRC values) from day 4, with acceptable results on the second and third day. On the first day of admission, however, AUPRC values were very low but correlated well with diagnostic uncertainty on the first day of admission.
Surprisingly, patients who survived hospital stay were significantly older but suffered less from lymphoma and metastatic cancer. These findings might indicate some admission bias for certain comorbidities or indicate a constitutional superiority of older people admitted to ICU despite renal failure.
The hospital mortality in this renal failure ICU population was 27.0% (399/1477) . Laboratory testing alone is only a part of the daily assessment of ICU patients. More research could elaborate predictive analysis of laboratory tests and other patient-related data in different patient populations.
Predictive analytics using ensemble methods are able to predict hospital or ICU outcome of renal patients with high accuracy. Predictive accuracy changes with the length of stay. Feature ranking enables quantitative assessment of patient data (e.g. laboratory tests) for predictive power. Lactate and white blood cell count best predict hospital mortality in this population. From the second day of ICU admission, predictive accuracy based on laboratory tests >80%. This generates opportunities for efficacy and efficiency analysis of other data recorded during ICU stay.
Licenses for RapidMiner/Radoop are provided by RapidMiner (Cambridge, MA, and USA).
Support for data extraction was kindly provided by Wessel Luijben from Anchormen (Amsterdam, The Netherlands). RapidMiner (RapidMinerAcademia) provided a free use of the commercial version of its platform to researchers and other academics at educational institutions.