Scenario of individuals for a specified disease.
In today’s world, data analytics has become the integral part of every domain such as IOT, security, healthcare, parallel systems, and so on. The importance of data analytics lies at the neck of what type of analytics to be applied for which integral part of the data. Depending upon the nature and type of data, the utilization of the analytical types may also vary. The most important type of analytics which has been predominantly used up in health-care sector is survival analytics. The term survival analytics has originated from a medical domain of context which in turn determines and estimates the survival rate of patients. Among all the types of data analytics, survival analytics is the one which entirely depends upon the time and occurrence of the event. This chapter deals with the need for survival data analytics with an explanatory part concerning the tools and techniques that focus toward survival analytics. Also the impact of survival analytics with the real world problem has been depicted as a case study.
- data analytics
- survival analytics
- parametric models
1. Introduction to survival analytics
Survival analysis refers to a branch of statistical analysis domain that evaluates the effect of predictors on time until an event, rather than the probability of an event, occurs. It is used to analyze data in which the time until the event is of interest. As the name indicates, this method has origins in the field of medical research for evaluating the impact of medicines or medical treatment on time until death. Survival analysis is also known as reliability analysis in the engineering discipline, duration analysis in the economics discipline, and event history analysis in the sociology discipline.
The term is originated from a medical context in which it has been used to estimate the survival rate of patients. Data classification can be dealt explicitly with the process and paradigms available in survival analytical models . The process of survival analytics can be explored through various techniques such as:
Survivor and hazard function rates
Cox proportional hazards regression analysis
Parametric survival analytic models
Survival random forest
2. Metrics for measurement in developing survival models
The process of survival analytics mainly depends on time and occurrence of the event. In survival analytics, time-varying covariates are the variables considered, which change with accordance to the occurrence of the event . The process of survival analytics can be signified and measured using the following measurements:
Event time distribution
The event time distribution corresponding for an event to occur with respect to the time function is defined in Eq. (1) as
where is the occurrence of an event at a time t and is the small change in time t with accordance to the event.
The distribution varies with accordance to a small change in time for the event that tends to happen for the given function .
Cumulative event time distribution
The cumulative event time distribution for the given function is defined in Eq. (2) as
The time period is estimated to be from 0 to .
The survival function provides the probability estimate in which the corresponding object of life will have existence beyond the period of time . This measure is also termed to be the survivor function or reliability function. The measure can be estimated using the following Eq. (3):
For the condition and , the following relationship holds:
The hazard function is also termed to be the hazard rate or the value of mortality, which is then the ratio among the probability density function and the survivor function which is depicted in Eq. (5) as
where is the hazard function, is the probability density function, and is the survival function.
3. Model classification in survival analytics
The process behind survival analytics is different when compared to predictive and descriptive analytics . Here, the time component is an important factor which efficiently determines the success or failure of a model. The following Figure 1 illustrates the model to be classified under survival analytics . Different sorts of functions are adaptable with different models based on the metric to be used with time as a component for the event to occur. The main target is to determine the right model to be chosen for the observed survival analytic data. In parametric model analysis, the survival curve depends only on the shape of the model with its function value.
The shape of the model can be estimated with regard to the characteristics of a nonparametric model. As an outcome, the shape of the hazard function also varies with regard to time. Some of the examples corresponding to hazard shapes are:
Convex bathtub-shaped hazard
Parametric survival analysis can be understood with the available forms of distributions. The distributions usually convey the efficacy in terms of probability curve with time analysis. The following are the distributions used to estimate the output measured by the survival curve:
The following illustrations provide an overview with regard to each of the distributions in detail:
4. Parametric survival analytical models
4.1. Exponential distribution
The exponential distribution is also known as negative exponential distribution. Exponential distribution is defined as a process in which events occur continuously and independently at a constant average rate. The exponential distribution is defined as
The survivor function is then estimated as
The hazard rate is then estimated as
Hence, from Eq. (9), it should be noted that the hazard rate is independent of time and therefore the risk corresponding to the event remains to be same. The following Figures 2 and 3 illustrate the event time with respect to the hazard rate.
4.2. Weibull distribution
The Weibull distribution is defined as a continuous probability distribution. The expression is defined in Eq. (10) as
The evaluation of hazard rate is given as
4.3. Log-logistic distribution
The log-logistic distribution is defined as a continuous probability distribution with negative random variable. In economics discipline log-logistic distribution is also known as the Fisk distribution in economics. Log-logistic is a continuous probability distribution for a nonnegative random variable. The following Figures 6 and 7 depict the distribution of log-logistic model.
5. Kaplan-Meier analytics and Cox regression model
The Kaplan-Meier test is broadly used within the pharmaceutical industry for specifying the expiry data for clinical drug in health-care sector, monitoring the effects of drugs and their gestures on recovery time or critical time. The Kaplan-Meier test is a statistical method that really works well for effective cancer treatments. This test determines the patient’s survival time between two groups. For clinical drug examination, a successful test indicates that the group of people taking the new drug has a shorter time to improvement or death than the group of people taking a place .
If the value of censoring is not available, then the value of the KM estimator for is found to be the same proportional value with respect to . If the value of censoring is present, then the following steps are followed:
Step 1: Order the event times in ascending order of levels .
Step 2: At each time , there are about individuals who are subjected to risk of the event.
Step 3: Let be the number of individuals who die at (churn, respond).
Therefore, the KM estimator is defined as
If there exists uniqueness in event times, then the KM estimator is measured using the life table for grouping of event times as expressed as
is the number of individuals at risk.
is individuals who die at a specified time.
The implementation of KM estimator can be extended by applying statistical tests such as hypothesis testing, Wilcoxon test, and likelihood ratio test. With exploratory data analytics along with KM estimator, the patterns and insights can be determined more efficiently from the data.
The second popular survival analysis method used for prediction is Cox. It is also known as the Cox model but often referred to as Cox regression. It is more popular in the Web of Science. More than 38,000 articles are cited indexing the Cox regression method.
There are some other statistical/analytical methods available that can predict time until an event, but survival analysis methods have the unique feature of considering the past history/experiences. Although these latter cases do not have a date for the target event, they are an integral part of the analysis. The terminology used in survival analysis is called censored cases.
Another formal definition for survival analysis is, it is basically defined as a set of methods for analyzing data where the outcome variable is the time/instance until the occurrence of an event of interest. The event can be an uncertainty accident, death, occurrence of a disease, or planned ones—marriage, divorce, etc. The time to event or survival time can be measured in various scales of time periods (days, weeks, years, etc.).
For example, if the event of interest is mild heart attack, then the survival time can be the time in years until a person develops a heart attack. Choose any survival methods that are discussed above. In survival analysis, time is a primary factor. The advantage of Cox regression over Kaplan-Meier is that it can accommodate any number of predictors, i.e., chances of getting heart attack, rather than group membership only. As is the case for all regression methods, there are two potential benefits of analysis using Cox regression: predictor ranking, with each predictor’s effect measured greater than the predicator’s threshold effect or less than the predicator’s threshold effect and the ability to make predictions with the regression results. Predictor rankings facilitate the analyst to recognize the factors that have the most influence on time to an event, and the regression results can be used to estimate the amount of until an event for a specific profile of any subject .
5.1. Different types of censoring
Data can be either right, left, or interval censored. It is the sum of defined time to, and the event of interest takes place at to + t, where t is an unknown factor and the event is only known to have occurred at to + c and the data is censored with a censored time, c.
Right censoring is the most common, occurring when the true event time is greater than the censored time, when c < t. It often arises when the event of interest has not occurred by the end of study and the subject has been lost to follow-up.
Left censoring is the opposite, occurring when the true event time is less than the censored time, when c > t.
Interval censoring is a concatenation of the left and right censoring, when the time is known to have occurred between two time points: c1 < t < c2.
Censoring is an important matter in survival analysis, signifying a particular type of missing data. Censoring is a random and non-informative study, and it is usually required in order to avoid bias in a survival analysis The interpretation of Cox regression and Kaplan results depends two factors: positive (e.g., a sale) or negative (e.g., product failure).
6. Case study for churn prediction
The following graphical illustrations depict the implementation of churn prediction and model deployment using RapidMiner. The algorithm used for analysis is decision tree . The implementation has been done with the lift chart analysis with evaluation in performance metrics. The attributes in the dataset includes person ID, churn status, gender, age, region code, transaction count, average balance, and total accounts.
The case study majorly explores with an application that is most probably used up with churn analysis. Nowadays, churn prediction is majorly analyzed in most of the industries to track the historical learning with the customers. The entire customer demographic data is analyzed day to day with regard to the maintenance of business relationships, customer transactions, products purchased, and the survey that has been obtained with regards to the business attractions. To make an exploration in this application, we have used up RapidMiner tool for the entire survival rate estimation and analysis of customers in an organization. The above Figure 8 provides the selection of application with regard to churn prediction for estimating the survival rate of customers.
RapidMiner is one of the good statistical and analytical tools which is mostly practiced in industries and academic institutions. Rapid miner provides a good insight for statisticians and mathematical experts to observe the insights and patterns that lie within the given data. The following Figure 9 explores the analytical results observed with RapidMiner.
All the incorporations in RapidMiner are made through the process connection through wires. The workflow of each process is written through Java. The process diagram depicts the step-by-step flow of algorithmic model development through drag option. Figure 10 provides a complete overview with regard to the process creation for churn prediction analysis. The algorithm used for the development of the model is decision tree classification algorithm . Decision tree algorithm provides a tree-like structure in a top-down fashion with a single root node and a number of leaf nodes with a terminating condition. The working of the algorithm depends on the splitting criterion to be used up for analysis (Figures 11 and 12).
Lift chart shows the effectiveness of the predictive model in which it has to be developed. It generally provides the ratio between the predicted values to that of the actual one. In Figure 13 for churn analysis, the chart provides the ratio between the confidence value and the count observed for churn analysis. Thus, churn prediction is employed for tracking the survival rate of customers with survival analytics. Survival analytics model can be deployed more efficiently for tracking the rate of patients in medical domain. The realm of health informatics lies at the heart of existence of subjects concerned with specific disease. The existence and the nonavailability of subjects with regard to the specific disease can be learnt with patterns and explorations through survival analytics models.
7. Case study using Kaplan-Meier analytics
Consider there are about 200 subjects of patient’s records which have been tracked over a period of time. The tracking is made in such a way that the total number of patients has to be confirmed with first year, second year, and third year, and so on. If all the subjects have existed for the given duration, then there will not be a case for probability of occurrence with regard to each of the subjects. To illustrate this complicated situation, consider the following scenario in Table 1.
|Time period||At risk||Become unavailable (censored)||Died||Survived|
Out of 200 subjects, 6 became unavailable and 10 have been found to be dead at the end of the first year.
With the remaining subjects, 6 became unavailable and 20 have been found to be dead at the end of the second year.
With the remaining subjects, 6 became unavailable and 30 have been found to be dead at the end of third year.
With the remaining subjects, 6 became unavailable and 40 have been found to be dead at the end of fourth year.
With the remaining subjects, 6 became unavailable and 50 have been found to be dead at the end of fifth year.
For this scenario we can determine the list of individuals who are all became unavailable at the end of the given time period. Use Kaplan-Meier analytics to determine the individuals who are at risk and what would be the probability estimate for the individuals survived at the end of the fifth year.
Step 1: Kaplan-Meier suggested that the subjects that became unavailable during the given time period can be counted among with those who survive through the end but are removed or deleted from the total number of individuals who are subjected to risk for the next given time period. With these conventions, the formulation is described in Table 2.
|Time period||At risk||Become unavailable (censored)||Died||Survived|
Hence, from Table 2, it has been observed that at the end of fifth year, 26 individuals have survived from the set of 200 individuals who were subjected to a specified disease. The next is to determine the Kaplan-Meier probability estimate for each of the time intervals with regard to the conditional probability. The following Table 3 provides the probability estimate for 5 years of risk analysis.
|Time period||At risk||Become unavailable (censored)||Died||Survived||Kaplan-Meier survival probability estimate|
|Year 1||200||6||10||190||(190/200) = 0.95|
|Year 2||184||6||20||164||(190/200) × (164/184) = 0.84|
|Year 3||158||6||30||128||(190/200) × (164/184) × (128/158) = 0.70|
|Year 4||122||6||40||82||(190/200) × (164/184) × (128/158) × (82/122) = 0.46|
|Year 5||76||6||50||26||(190/200) × (164/184) × (128/158) × (82/122) × (26/76) = 0.15|
From Table 3, it has been observed that at the end of fifth year, the conditional probability estimate was found to be 0.15% of individuals. Hence, from the perspective of survival probabilistic estimate, we can determine the existence rate of individuals for the given time period .