An Introduction to Survival Analytics, Types, and Its Applications

In today’s world, data analytics has become the integral part of every domain such as IOT, security, healthcare, parallel systems, and so on. The importance of data analytics lies at the neck of what type of analytics to be applied for which integral part of the data. Depending upon the nature and type of data, the utilization of the analytical types may also vary. The most important type of analytics which has been predominantly used up in health-care sector is survival analytics. The term survival analytics has originated from a medical domain of context which in turn determines and estimates the survival rate of patients. Among all the types of data analytics, survival analytics is the one which entirely depends upon the time and occurrence of the event. This chapter deals with the need for survival data analytics with an explanatory part concerning the tools and techniques that focus toward survival analytics. Also the impact of survival analytics with the real world problem has been depicted as a case study.


Introduction to survival analytics
Survival analysis refers to a branch of statistical analysis domain that evaluates the effect of predictors on time until an event, rather than the probability of an event, occurs. It is used to analyze data in which the time until the event is of interest. As the name indicates, this method has origins in the field of medical research for evaluating the impact of medicines or medical treatment on time until death. Survival analysis is also known as reliability analysis in the engineering discipline, duration analysis in the economics discipline, and event history analysis in the sociology discipline.
The term is originated from a medical context in which it has been used to estimate the survival rate of patients. Data classification can be dealt explicitly with the process and paradigms available in survival analytical models [1]. The process of survival analytics can be explored through various techniques such as:

Metrics for measurement in developing survival models
The process of survival analytics mainly depends on time and occurrence of the event. In survival analytics, time-varying covariates are the variables considered, which change with accordance to the occurrence of the event [2]. The process of survival analytics can be signified and measured using the following measurements:

Event time distribution
The event time distribution corresponding for an event to occur with respect to the time function t is defined in Eq. (1) as where t is the occurrence of an event at a time t and Δt is the small change in time t with accordance to the event.
The distribution varies with accordance to a small change in time for the event that tends to happen for the given function f (t) .

Cumulative event time distribution
The cumulative event time distribution for the given function f (t) is defined in Eq. (2) as The time period is estimated to be from 0 to t .

Survival function
The survival function provides the probability estimate in which the corresponding object of life will have existence beyond the period of time t . This measure is also termed to be the survivor function or reliability function. The measure can be estimated using the following Eq. (3): An Introduction to Survival Analytics,Types,and Its Applications DOI: http://dx.doi.org/10.5772/intechopen.80953 For the condition S (0) = 1 and S (∞) = 0 , the following relationship holds:

Hazard function
The hazard function is also termed to be the hazard rate or the value of mortality, which is then the ratio among the probability density function and the survivor function which is depicted in Eq. (5) as where h (t) is the hazard function, f (t) is the probability density function, and S (t) is the survival function.

Model classification in survival analytics
The process behind survival analytics is different when compared to predictive and descriptive analytics [3]. Here, the time component is an important factor which efficiently determines the success or failure of a model. The following Figure 1 illustrates the model to be classified under survival analytics [4]. Different sorts of functions are adaptable with different models based on the metric to be used with time as a component for the event to occur. The main target is to determine the right model to be chosen for the observed survival analytic data. In parametric model analysis, the survival curve depends only on the shape of the model with its function value.
The shape of the model can be estimated with regard to the characteristics of a nonparametric model. As an outcome, the shape of the hazard function also varies with regard to time. Some of the examples corresponding to hazard shapes are:  Parametric survival analysis can be understood with the available forms of distributions. The distributions usually convey the efficacy in terms of probability curve with time analysis. The following are the distributions used to estimate the output measured by the survival curve: The following illustrations provide an overview with regard to each of the distributions in detail:

Exponential distribution
The exponential distribution is also known as negative exponential distribution. Exponential distribution is defined as a process in which events occur continuously and independently at a constant average rate. The exponential distribution is defined as The survivor function is then estimated as The hazard rate is then estimated as Hence, from Eq. (9), it should be noted that the hazard rate is independent of time and therefore the risk corresponding to the event remains to be same. The following Figures 2 and 3 illustrate the event time with respect to the hazard rate.

Weibull distribution
The Weibull distribution is defined as a continuous probability distribution. The expression is defined in Eq. (10) as An Introduction to Survival Analytics,Types,and Its Applications DOI: http://dx.doi.org/10.5772/intechopen.80953 The evaluation of hazard rate is given as Hence, for this case, the hazard rate depends on time which can be in either increasing or decreasing mode. The following Figures 4 and 5 depict the value of hazard rate with respect to time t .   An Introduction to Survival Analytics,Types,and Its Applications DOI: http://dx.doi.org/10.5772/intechopen.80953

Log-logistic distribution
The log-logistic distribution is defined as a continuous probability distribution with negative random variable. In economics discipline log-logistic distribution is also known as the Fisk distribution in economics. Log-logistic is a continuous probability distribution for a nonnegative random variable. The following Figures 6 and 7 depict the distribution of log-logistic model.

Kaplan-Meier analytics and Cox regression model
The Kaplan-Meier test is broadly used within the pharmaceutical industry for specifying the expiry data for clinical drug in health-care sector, monitoring the effects of drugs and their gestures on recovery time or critical time. The Kaplan-Meier test is a statistical method that really works well for effective cancer treatments. This test determines the patient's survival time between two groups. For clinical drug examination, a successful test indicates that the group of people taking the new drug has a shorter time to improvement or death than the group of people taking a place [5].
If the value of censoring is not available, then the value of the KM estimator for S (t) is found to be the same proportional value with respect to t . If the value of censoring is present, then the following steps are followed: Step 1: Order the event times in ascending order of levels t 1 < t 2 < t 3 < ⋯ < t k .
Step 2: At each time t j , there are about n j individuals who are subjected to risk of the event.
Step 3: Let d j be the number of individuals who die at t j (churn, respond). Therefore, the KM estimator is defined as If there exists uniqueness in event times, then the KM estimator is measured using the life table for grouping of event times as expressed as where n j is the number of individuals at risk. d j is individuals who die at a specified time. The implementation of KM estimator can be extended by applying statistical tests such as hypothesis testing, Wilcoxon test, and likelihood ratio test. With exploratory data analytics along with KM estimator, the patterns and insights can be determined more efficiently from the data.
The second popular survival analysis method used for prediction is Cox. It is also known as the Cox model but often referred to as Cox regression. It is more popular in the Web of Science. More than 38,000 articles are cited indexing the Cox regression method.
There are some other statistical/analytical methods available that can predict time until an event, but survival analysis methods have the unique feature of considering the past history/experiences. Although these latter cases do not have a date for the target event, they are an integral part of the analysis. The terminology used in survival analysis is called censored cases.
Another formal definition for survival analysis is, it is basically defined as a set of methods for analyzing data where the outcome variable is the time/instance until the occurrence of an event of interest. The event can be an uncertainty accident, death, occurrence of a disease, or planned ones-marriage, divorce, etc. The time to event or survival time can be measured in various scales of time periods (days, weeks, years, etc.).
For example, if the event of interest is mild heart attack, then the survival time can be the time in years until a person develops a heart attack. Choose any survival methods that are discussed above. In survival analysis, time is a primary factor. An Introduction to Survival Analytics,Types,and Its Applications DOI: http://dx.doi.org/10.5772/intechopen.80953 The advantage of Cox regression over Kaplan-Meier is that it can accommodate any number of predictors, i.e., chances of getting heart attack, rather than group membership only. As is the case for all regression methods, there are two potential benefits of analysis using Cox regression: predictor ranking, with each predictor's effect measured greater than the predicator's threshold effect or less than the predicator's threshold effect and the ability to make predictions with the regression results. Predictor rankings facilitate the analyst to recognize the factors that have the most influence on time to an event, and the regression results can be used to estimate the amount of until an event for a specific profile of any subject [6].

Different types of censoring
Data can be either right, left, or interval censored. It is the sum of defined time t o , and the event of interest takes place at t o + t, where t is an unknown factor and the event is only known to have occurred at t o + c and the data is censored with a censored time, c.
Right censoring is the most common, occurring when the true event time is greater than the censored time, when c < t. It often arises when the event of interest has not occurred by the end of study and the subject has been lost to follow-up.
Left censoring is the opposite, occurring when the true event time is less than the censored time, when c > t.
Interval censoring is a concatenation of the left and right censoring, when the time is known to have occurred between two time points: c 1 < t < c 2 .
Censoring is an important matter in survival analysis, signifying a particular type of missing data. Censoring is a random and non-informative study, and it is usually required in order to avoid bias in a survival analysis The interpretation of Cox regression and Kaplan results depends two factors: positive (e.g., a sale) or negative (e.g., product failure).

Case study for churn prediction
The following graphical illustrations depict the implementation of churn prediction and model deployment using RapidMiner. The algorithm used for analysis is decision tree [7]. The implementation has been done with the lift chart analysis with evaluation in performance metrics. The attributes in the dataset includes person ID, churn status, gender, age, region code, transaction count, average balance, and total accounts.
The case study majorly explores with an application that is most probably used up with churn analysis. Nowadays, churn prediction is majorly analyzed in most of the industries to track the historical learning with the customers. The entire customer demographic data is analyzed day to day with regard to the maintenance of business relationships, customer transactions, products purchased, and the survey that has been obtained with regards to the business attractions. To make an exploration in this application, we have used up RapidMiner tool for the entire survival rate estimation and analysis of customers in an organization. The above Figure 8 provides the selection of application with regard to churn prediction for estimating the survival rate of customers.
RapidMiner is one of the good statistical and analytical tools which is mostly practiced in industries and academic institutions. Rapid miner provides a good insight for statisticians and mathematical experts to observe the insights and patterns that lie within the given data. The following Figure 9 explores the analytical results observed with RapidMiner.
All the incorporations in RapidMiner are made through the process connection through wires. The workflow of each process is written through Java. The process diagram depicts the step-by-step flow of algorithmic model development through drag option. Figure 10 provides a complete overview with regard to the process creation for churn prediction analysis. The algorithm used for the development of the model is decision tree classification algorithm [8]. Decision tree algorithm provides a tree-like structure in a top-down fashion with a single root node and a number of  An Introduction to Survival Analytics,Types,and Its Applications DOI: http://dx.doi.org/10.5772/intechopen.80953 leaf nodes with a terminating condition. The working of the algorithm depends on the splitting criterion to be used up for analysis (Figures 11 and 12).
Lift chart shows the effectiveness of the predictive model in which it has to be developed. It generally provides the ratio between the predicted values to that of the actual one. In Figure 13 for churn analysis, the chart provides the ratio between the confidence value and the count observed for churn analysis. Thus, churn prediction is employed for tracking the survival rate of customers with survival analytics. Survival analytics model can be deployed more efficiently for tracking the rate of patients in medical domain. The realm of health informatics lies at the heart of existence of subjects concerned with specific disease. The existence and  the nonavailability of subjects with regard to the specific disease can be learnt with patterns and explorations through survival analytics models.

Case study using Kaplan-Meier analytics
Consider there are about 200 subjects of patient's records which have been tracked over a period of time. The tracking is made in such a way that the total number of patients has to be confirmed with first year, second year, and third year, and so on. If all the subjects have existed for the given duration, then there will not be a case for probability of occurrence with regard to each of the subjects. To illustrate this complicated situation, consider the following scenario in Table 1.  An Introduction to Survival Analytics,Types,and Its Applications DOI: http://dx.doi.org/10.5772/intechopen.80953 Condition: 1. Out of 200 subjects, 6 became unavailable and 10 have been found to be dead at the end of the first year.
2. With the remaining subjects, 6 became unavailable and 20 have been found to be dead at the end of the second year.
3. With the remaining subjects, 6 became unavailable and 30 have been found to be dead at the end of third year.
4. With the remaining subjects, 6 became unavailable and 40 have been found to be dead at the end of fourth year.
5. With the remaining subjects, 6 became unavailable and 50 have been found to be dead at the end of fifth year.

Time period At risk Become unavailable (censored) Died Survived
Year 1   © 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
For this scenario we can determine the list of individuals who are all became unavailable at the end of the given time period. Use Kaplan-Meier analytics to determine the individuals who are at risk and what would be the probability estimate for the individuals survived at the end of the fifth year. Solution: Step 1: Kaplan-Meier suggested that the subjects that became unavailable during the given time period can be counted among with those who survive through the end but are removed or deleted from the total number of individuals who are subjected to risk for the next given time period. With these conventions, the formulation is described in Table 2.
Hence, from Table 2, it has been observed that at the end of fifth year, 26 individuals have survived from the set of 200 individuals who were subjected to a specified disease. The next is to determine the Kaplan-Meier probability estimate for each of the time intervals t with regard to the conditional probability. The following Table 3 provides the probability estimate for 5 years of risk analysis.
From Table 3, it has been observed that at the end of fifth year, the conditional probability estimate was found to be 0.15% of individuals. Hence, from the perspective of survival probabilistic estimate, we can determine the existence rate of individuals for the given time period t .