Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory

The long short-term memory neural network (LSTM) is a type of recurrent neural network (RNN). During the training of RNN architecture, sequential information is used and travels through the neural network from input vector to the output neurons, while the error is calculated and propagated back through the network to update the network parameters. Information in these networks incor-porates loops into the hidden layer. Loops allow information to flow multi-directionally so that the hidden state signifies past information held at a given time step. Consequently, the output is dependent on the previous predictions which are already known. However, RNNs have limited capacity to bridge more than a certain number of steps. Mainly this is due to the vanishing of gradients which causes the predictions to capture the short-term dependencies as information from earlier steps decays. As more layers in RNN containing activation functions are added, the gradient of the loss function approaches zero. The LSTM neural networks (LSTM-ANNs) enable learning long-term dependencies. LSTM introduces a memory unit and gate mechanism to enable capture of the long dependencies in a sequence. Therefore, LSTM networks can selectively remember or forget information and are capable of learn thousands timesteps by structures called cell states and three gates.


Introduction
Artificial neural networks (ANNs) are a type of the computing system that mimics and simulates the function of the human brain to analyze and process the complex data in an adaptive approach. They are capable of implementing massively parallel computations for mapping, function approximation, classification, and pattern recognition processing that require less formal statistical training. Moreover, ANNs have the ability to identify very high complex nonlinear relationships between outcome (dependent) and predictor (independent) variables using multiple training algorithms [1,2]. Generally speaking, in terms of network architectures, ANNs tend be classified into two different classes, feedforward and recurrent ANNs, each may have several subclasses.
Feedforward is a widely used ANN paradigm for, classification, function approximation, mapping and pattern recognition. Each layer is fully connected to neurons in another layer with no connection between neurons in the same layer.
As the name suggests, information is fed in a forward direction from the input to the output layer through one or more hidden layers. MLP feed forward with one hidden layer can virtually predict any linear or non-linear model to any degree of accuracy, assuming that you have a appropriate number of neurons in hidden layer and an appropriate amount of data. Adding more neurons in the hidden layers to an ANN architecture gives the model the flexibility of fitting extremely complex nonlinear functions. This also holds true in classification modeling in approximating any nonlinear decision boundary with great accuracy [2].
Recurrent neural networks (RNNs) emerged as an operative and scalable ANN model for several learning problems associated with sequential data. Information in these networks incorporate loops into the hidden layer. These loops allow information to flow multi-directionally so that the hidden state signifies past information held at a given time step. Thus, these network types have an infinite dynamic response to sequential data. Many applications, such as Apple's Siri and Google's voice search, use RNN.
The most popular way to train a neural network is by backpropagation (BP). This method can be used with either feedforward or recurrent networks. BP involves working backward through each timestep to calculate prediction errors and estimate a gradient, which in turn is used to update the weights in the network. For example, to enable the long sequences found in RNNs, multiple timesteps are conducted that unrolls the network, adds new layers, and recalculates the prediction error, resulting in a very deep network.
However, standard and deep RNN neural networks may suffer from vanishing or exploding gradients problems. As more layers in RNN containing activation functions are added, the gradient of the loss function approaches zero. As more layers of activating functions are added, the gradient loss function may approach zero (vanish), leaving the functions unchanged. This stops further training and ends the procedure prematurely. As such, parameters capture only short-term dependencies, while information from earlier time steps may be forgotten. Thus, the model converges on a poor solution. Because error gradients can be unstable, the reverse issue, exploding gradients may occur. This causes errors to grow drastically within each time step (MATLAB, 2020b). Therefore, backpropagation may be limited by the number of timesteps.
Long Short-Term Memory (LSTM) networks address the issue of vanishing/ exploding gradients and was first introduced by [3]. In addition to the hidden state in RNN, an LSTM block includes memory cells (that store previous information) and introduces a series of gates, called input, output, and forget gates. These gates allow for additional adjustments (to account for nonlinearity) and prevents errors from vanishing or exploding. The result is a more accurate predicted outcome, the solution does not stop prematurely, nor is previous information lost [4].
Practical applications of LSTM have been published in medical journals. For example, studies [5][6][7][8][9][10][11][12][13][14] used different variants of RNN for classification and prediction purposes from medical records. Important, LSTM does not have any assumption about elapsed time measures can be utilized to subtype patients or diseases. In one such study, LSTM was used to make risk predictions of disease progression for patients with Parkinson's by leveraging longitudinal medical records with irregular time intervals [15].
The purpose of this chapter is to introduce Long Short-Term Memory (LSTM) networks. This chapter begins with introduction of multilayer feedforward architectures. The core characteristic of an RNN and vanishing of the gradient will be explained briefly. Next, the LSTM neural network, optimization of network parameters, and the methodology for avoiding the vanishing gradient problem will be covered, respectively. The chapter ends with a MATLAB example for LSTM.

Artificial neural networks and multilayer neural network
Artificial Neural Networks (ANNs) are powerful computing techniques that mimic functions of the human brain to solve complex problems arising from big and messy data. As a machine learning method, ANNs can act as universal approximators of complex functions capable of capturing nonlinear relationships between inputs and outcomes. They adaptively learn functional structures by simultaneously utilizing a series of nonlinear and linear activation functions. ANNs offer several advantages. They require less formal statistical training, have the ability to detect all possible interactions between input variables, and include training algorithms adapted from backpropagation algorithms to improve the predictive ability of the model [2].
Feed forward multilayer perceptron (Figure 1) is the most used in ANN architectures. Uses include function approximation, classification, and pattern recognition. Similar to information processing within the human brain, connections are formed from successive layers. They are fully connected because neurons in each layer are connected to the neurons from the previous and the subsequent layer through adaptable synaptic network parameters. The first layer of multi-layer ANN is called the input layer (or left-most layer) that accepts the training data from sources external to the network. The last layer (or rightmost layer) is called the output layer that contains output units of the network. Depending on prediction or classification, the number of neurons in the output layer may consist of one or more neurons. The layer(s) between input and output layers are called hidden layer(s). Depending on the architecture multiple hidden layers can be placed between input and output layers. Training occurs at the neuronal level in the hidden and output layers by updating synaptic strengths, eliminating some, and building new synapses. The central idea is to distribute the error function across the hidden layers, corresponding to their effect on the output. Figure 1 demonstrates the architecture (adapted from Okut, 2016). Artificial neural network design with 4 inputs (p i ). Each input is connected to up to 3 neurons via coefficients w l ð Þ kj (l denotes layer; j denotes neuron; k denotes input variable). Each hidden and output neuron has a bias parameter b l j . Here P = inputs, IW = weights from input to hidden layer (12 weights), LW = weights from hidden to output layer (3 weights), b 1 = Hidden layer biases (3 biases), b 2 = Output layer biases (1 bias), n 1 = IWP + b 1 is the weighted summation of the first layer, a 1 = f(n 1 ) is output of hidden layer, n 2 = LWa 1 + b 2 is weighted summation of the second layer andt = a 2 = f(n 2 ) is the predicted value of the network. The total number of parameters for this ANN is 12 of a simple feedforward MLP where P identifies the input layer, next one or more hidden layers, which is followed by the output layer containing the fitted values. The feed forward MLP networks is evaluated in two stages. First, in the feedforward stage information comes from the left and each unit evaluates its activation function f. The results (output) are transmitted to the units connected to the right. The second stage involves the backpropagation (BP) step and is used for training the neural network using gradient descent algorithm in which the network parameters are moved along the negative of the gradient of the performance function The process consists of running the whole network backward and adjusting the weights (and error) in the hidden layer. The feedforward and backward steps are repeated several times, called epochs. The algorithm stops when the value of the loss (error) function has become sufficiently small.

Recurrent neural networks
Because the two stages, outlined in Figure 1 above, are used in all neural networks, we can extend MLP feedforward neural networks for use in sequential data or time series data. To do so, we must address time sequences. Recurrent neural networks (RNNs) address this issue by conducting multiple time steps that unrolls the network, adds new layers, and recalculates the prediction error, resulting in a very deep network. First, the connections between nodes in the hidden layer(s) form a directed graph along a temporal sequence allowing information to persist. Through such a mechanism, the concept of time creates the RNNs memory. Here, the architecture receives information from multiple previous layers of the network. Figure 2 outlines the hidden layer for RNN and demonstrates the nonlinear function of the previous layers and the current input (p). Here, the hyperbolic tangent activation function is used to generate the hidden state. The model has memory since the bias term is based on the "past". As a consequence, the outputs from the previous step are fed as input to the current step. Another way to think about RNNs is that a recurrent neural network has multiple copies of the same network, each passing a message to a successor (Figure 3). Thus, the output value of the last time point is transmitted back to the neural network, so that the parameter estimation (weight calculation) of each time point is related to the content of the previous time point.

Training recurrent neural networks
Similar to feedforward MLP networks, RNNs have two stages, a forward and a backward stage. Each works together during the training of the network. However, structures and calculation patterns differ. Let us first consider the forward pass.
The forward pass will be summarized into 5 steps: 1. Summation step. In this step two different source of information are combined before nonlinear activation function will be take place.    The weight of the previous hidden state and current input are placed in a trainable weight matrix. Element-wise multiplication of the previous hidden state vector with the hidden state weights (h tÀ1 W h ð Þ Þ and element wise multiplication of the current input vector with the current input weights W p ð Þ p t À Á produces the parameterized of state vector and input vector.
2. Hyperbolic tangent activation function is applied to the summed of the two to push the output between À1 and 1 (Figure 2).
3. The network input to the output unit at time t with element-wise multiplication of output weights and with updated (current) hidden state h t W y ð Þ .
Therefore, the value before a softmax activation function takes place is Here W y ð Þ and b y ð Þ are the weight and bias of the output layer.
4. The output of the network at time t is calculated (the activation function applied to the output layer depends on the type of target (dependent) variable and the values coming from the hidden units. Again, a second activation function (mostly sigmoid) is applied to the value generated by the hidden node. The predicted value of a RNN block with sigmoid: During the training of the forward pass of the RNN, the network outputs predicted value (ŷ i , i ¼ t À 1, t, t þ 1, … ::, t þ sÞ at each time step. We can image the unfold (unroll) of RNN given in Figure 3. That is, for each time step, an RNN can be imaged as multiple copies of the same network for the complete sequence. For example, if the sequence is a sentence of four words as; "I have kidney problem" then the RNN would be unrolled into a 4-layer neural network, one layer for each word. The output given in (3) is used to train the network using gradient descent after calculation of error in (4). 5. Then the error (loss function, cost function) at each time step is calculated to start the "backward pass": Here y andŷ t are actual and predicted outcomes, respectively. After calculation of the error at each time step, this calculated error is injected backwards into the network to update the network weights at each epoch (iteration). As there are many training algorithms based on some modification of standard backpropagation, the chosen error measure can be different and depends on the selected algorithm. For example, the error E t y À Á ¼ E t y t ,ŷ t À Á given in (4), has an additional term, E t (w), in the Bayesian regularized neural networks (BRANN) training algorithm that penalizes large weights in anticipation of achieving smoother mapping. Both E t (y), E t (w), have a coefficient as β and α, respectively (also referred to as regularization parameters or hyper-parameters) that need to be estimated adaptively [1,16].

Stage 2: Backward Pass.
After the forward pass of RNN, the calculated error (loss function, cost function) at each time step is injected backwards into the network to update the network weights at each iteration. The idea of RNN unfolding in Figure 3 takes place the bigger part in the way RNNs are implemented for the backward pass. Like standard backpropagation in feed forward MLP, the backward pass consists of a repeated application of the chain rule. For this reason, the type of backpropagation algorithm used for an RNN to update the network parameters is called backpropagation through time (BPTT). In BPTT, the RNN network is unfolded in time to construct a feed forward MLP neural network. Then, the generalized delta rule is applied to update the weights W (p) , W (h) and W (y) and biases b (h) and b (y) . Remember, the goal with backpropagation is minimizing the gradients of error with respect to the network parameter space (W (p) , W (h) and W (y) and biases b (h) and b (y) ) and then updates the parameters using Stochastic Gradient Descent. The following equation is used to update the parameters for minimizing the error function: Here, a is learning rate and ∂E ∂W is the derivative of the error function with respect to parameters space. The same is applied to all weights and biases on the networks. The error each time step is The total error is calculated by the summation of the error from all time steps as: The value of gradients ∂E ∂W at each time step is calculated as (the same rule is applied to all parameters on the network): To calculate the error gradient given in Eq. (6): To calculate the overall error gradient, the chain rule of differentiation given in (7) is used.
Then the network weights can be updated as follow: Note that, as given in (2), the current state depends on the quantity of the previous state (h t-1 ) and the other parameters. Therefore, the differentiation of ℎ t and ℎ j (here j = 0, 1, … ., t-1) given in (7) is a derivative of a hidden state that stores memory at time t.
The Jacobians of any time ∂h j ∂h jÀ1 and for the entire time will be: while the Jacobian matrix for hidden state is given by: Putting the Eqs. (7) and (8) together, we have the following relationship: In other words, because the network parameters are used in every step up to the output, we need to backpropagate gradients from last time step (t = t) through the network all the way to t = 0. The Jacobians in (10), , demonstrates the eigen decomposition given by where the eigenvalues and eigenvectors are generated. Here the W i ð ÞT is the transpose of the network parameters matrix. Consequently, if the largest eigenvalue is greater or smaller than 1, the RNN suffers from vanishing or exploding gradient problems (see Figure 3).

Long short-term memory
As mentioned before, the output from RNNs is dependent on its previous state or previous N time steps circumstances. Conventional RNN face difficulty in learning and maintaining long-range dependencies. Imagine the unfolding RNN given in Figure 3. Each time step requires a new copy of the network. With large RNNs, thousands, even millions of weights are needed to be updated. In other word, ∂h j ∂h jÀ1 is a chain rule itself. For Figure 3, for example, the derivative of ∂h 4 ∂h 0 . Imagine an unrolling the RNN a thousand times, in which every activation of the neurons inside the network are replicated thousands of times. This means, especially for larger networks, that thousands or millions of weights are needed. As Jacobian matrix will play a role to update the weights, the values of the Jacobian matrix will range between À1, 1 if tanh activation function is applied to the (2). It can be easily imagined that the derivatives of tanh (or sigmoid) activation function would be 0 at the end. Zero gradients drive other gradients in previous layers towards 0. Thus, with small values in the Jacobian matrix and multiple matrix multiplications (t-j, in particular) the gradient values will be shrunk exponentially fast, eventually vanishing completely after a few time steps. As a result, the RNN ends up not learning longrange dependencies. As in RNNs, the vanishing gradients problem will be an important issue for the deep feedforward MLP when multiple hidden layers (multiple neurons within each) are placed between input and output layers. The long short-term memory networks (LSTMs) are a special type of RNN that can overcome the vanishing gradient problem and can learn long-term dependencies. LSTM introduces a memory unit and a gate mechanism to enable capture of the long dependencies in a sequence. The term "long short-term memory" originates from the following intuition. Simple RNN networks have long-term memory in the form of weights. The weights change gradually during the training of the network, encoding general knowledge about the training data. They also have shortterm memory in the form of ephemeral activations, which flows from each node to successive nodes [17,18].

The architecture of LSTM
The neural network architecture for an LSTM block given in Figure 4 demonstrates that the LSTM network extends RNN's memory and can selectively remember or forget information by structures called cell states and three gates. Thus, in addition to a hidden state in RNN, an LSTM block typically has four more layers. These layers are called the cell state (C t ), an input gate (i t ), an output gate (O t ), and a forget gate (f t ). Each layer interacts with each other in a very special way to generate information from the training data. A block diagram of LSTM at any timestamp is depicted in Figure 4. This block is a recurrently connected subnet that contains the same cell state and three gates structure. The p t , h tÀ1 , and C tÀ1 correspond to the input of the current time step, the hidden output from the previous LSTM unit, and the cell state (memory) of the previous unit, respectively. The information from the previous LSTM unit is combined with current input to generate a newly predicted value. The LSTM block is mainly divided into three gates: forget (blue), input-update (green), and output (red). Each of these gates is connected to the cell state to provide the necessary information that flows from the current time step to the next.

Cell state
As shown in the upper part of Figures 4 and 5, the cell state is the key to LSTMs and represents the memory of LSTM networks. The process for the cell state is very much like to a conveyor belt or production chain. The information about the parameters runs straight forward the entire chain, with only some linear interactions, such as multiplication and addition. The state of information depends on these interactions. If there are no interactions, the information will run along without changes. The LSTM block removes or adds information to the cell state through the gates, which allow optional information to cross [19].

Forget gate
The Forget Gate (f t ) decides the type of information that should be thrown away or kept from the cell state. This process is implemented by a sigmoid activation function. The sigmoid activation function outputs values between 0 and 1 coming from the weighted input (W f p t ), previous hidden state (h t-1 ), and a bias (b f ). The forget gates (Figure 6) can be described by the equation given in (11). Here, σ is the sigmoid activation function, W (f) and b (f) are the weight matrix and bias vector, which will be learned from the input training data.
The function takes the old output (h tÀ1 ) at time t À 1 and the current input (p t ) at time t for calculating the components that control the cell state and hidden state of the layer. The results are [0,1], where 1 represents "completely hold this" and 0 represents "completely throw this away" (Figure 6).

Input gate
The Input Gate (i t ) controls what new information will be added to the cell state from the current input. This gate also plays the role to protect the memory contents from perturbation by irrelevant input (Figures 7 and 8). A sigmoid activation function is used to generate the input values and converts information between 0 and 1. So, mathematically the input gate is: where W (i) and b (i) are the weight matrix and bias vector, p t is the current input timestep index with the previous time step h t-1 . Similar to the forget gate, the parameters in the input gate will be learned from the input training data. At each time step, with the new information p t , we can compute a candidate cell state.
Next, a vector of new candidate values,C t , is created. The computation of the new candidate is similar to that of (11) and (12) but uses a hyperbolic tanh activation function with a value range of (À1,1). This leads to the following Eq. (13) at time t.
In the next step, the values of the input state and cell candidate are combined to create and update the cell state as given in (14). The linear combination of the input gate and forget gate are used for updating the previous cell state (C t-1 ) into current cell state (C t ). Once again, the input gate (i t ) governs how much new data should be taken into account via the candidate (C t Þ, while the forget gate (f t ) reports how much of the old memory cell content (C t-1 ) should be retained. Using the same pointwise multiplication ( ⨀ ¼Hadamard product), we arrive at the following updated equation:

Output gate
The Output Gate (o t ) controls which information to reveal from the updated cell state (C t ) to the output in a single time step. In other words, the output gate determines what the value of the next hidden state should be in each time step. As depicted in Figure 9, the hidden state comprises information on previous inputs. Moreover, the calculated value of the hidden state for the given time step is used for  the prediction (ŷ t ¼ softmax : ð ÞÞ. Here, softmax is a nonlinear activation function (sigmoid, hyperbolic tangent etc.).
First, the previous hidden state (h t-1 ) is passed to the current input into a sigmoid function. Next newly updated cell state is generated with the tanh function [15,18]. Finally, the tanh output is multiplied with the sigmoid output to determine what information the hidden state should carry (16). The final product of the output gate is an updated of the hidden state, and this is used for the prediction at time step t. Therefore, the aim of this gate is to separate the updated cell state (updated memory) from the hidden state. The updated cell state (C t ) contains a lot of information that is not necessarily required to be saved in the updated hidden state. However, this information is critical as the updated hidden state at each time is used in all gates of an LSTM block. Thus, the output gate does the assessment regarding what parts of the cell state (C t ) is presented in the hidden state (h t ). The new cell and new hidden states are then passed to the next time step (Figure 9).

Figure 9.
The output state decides what information will be output using a sigmoid σ and tanh (to push the values to be between À1 and 1) layers.

Forget gate:
Controls what information to throw away and decides how much from the part should be

Backward pass
Like the RNN networks, an LSTM network generates an outputŷ t À Á at each time step that is used to train the network via gradient descent (Figure 10). During the backward pass, the network parameters are updated at each epoch (iteration). The only fundamental difference between the back-propagation algorithms of the RNN and LSTM networks is a minor modification of the algorithm. Here, the calculated error term at each time step is E t ¼ À y t logŷ t . As in RNN, the total error is calculated by the summation of error from all time steps E ¼ P t À y t logŷ t: Similarly, the value of gradients ∂E ∂W at each time step is calculated and then the summation of the gradients at each time steps ∂W is obtained. Remember, the predicted value,ŷ t , is a function of the hidden state (ŷ t ¼ σ W y ð Þ h t þ b y ð Þ Þ and the hidden state (h t ) is a function of the cell state (h t ¼ o t ⨀ tanh C t ð ÞÞ:These both are subjected in the chain rule. Hence, the derivatives of individual error terms with respect the network parameter: and for the overall error gradient using the chain rule of differentiation is: As Eq. (19) illustrates, the gradient involves the chain rule of ∂c t in an LSTM training using the backpropagation algorithm, while the gradient equation involves a chain rule of ∂h t for a basic RNN. Therefore, the Jacobian matrix for cell state for an LSTM is [20]: 1 ⋯ ∂c j,s ∂c jÀ1,s The problem of gradient vanishing Recall the Eq. (14) for cell state is C t ¼ f t ⨀ C tÀ1 þ i t ⨀C t . When we consider Eq. (19), the value of the gradients in the LSTM is controlled by the chain of 3. Output gate: Determines the part of the current cell state makes it to the output Calculate the LSTM block error for the time step: E t y t ,ŷ t À Á ¼ Ày t logŷ t Á derivatives starting from the part ∂c t ∂c tÀ1 . Expanding this value using the expression for Note the term ∂c t ∂c tÀ1 does not have a fixed pattern and can yield any positive value in the LSTM, while the ∂c t ∂c tÀ1 term in the standard RNN can yield values greater than 1 or less than 1 after certain time steps. Thus, for an LSTM, the term will not converge to 0 or diverge completely, even for an infinite number of time steps. If the gradient starts converging towards zero, the weights of the gates are adjusted to bring it closer to 1.

Other type of LSTMs
Several modifications to original LSTM architecture have been recommended over the years. Surprisingly, the original continues to outperform, and has similar predictive ability compared with variants of LSTM over 20 years.

Peephole connections
This is a type of LSTM by adding "peephole connections" to the standard LSTM network. The theme stands for peephole connections needed to capture information inherent to time lags. In other words, with peephole connections the information conveyed by time intervals between sub-patterns of sequences is included to the network recurrent. Thus, peephole connections concatenate the previous cell state (C t-1 ) information to the forget, input and output gates. That is, the expression of these gates with peephole connection would be: This configuration was offered to improve the predictive ability of LSTMs to count and time distances between rare events [21].

Gated recurrent units
Gated recurrent units (GRU) is a simplified version of the standard LSTM designed in a manner to have more persistent memory in order to make it easier for RNNs. A GRU is called gated RNN and introduced by [22]. In a GRU, forget and input gates are merged into a single gate named "update gate". Moreover, the cell state and hidden state are also merged. Therefore, the GRU has fewer parameters and has been shown to outperform LSTM on some tasks to capture long-term dependencies.

Multiplicative LSTMs (mLSTMs)
This configuration of LSTM was introduced by Krause et al., [23]. The architecture is for sequence modeling combines LSTM and multiplicative RNN architectures. mLSTM is characterized by its ability to have different recurrent transition functions for each possible input, which makes it more expressive for autoregressive density estimation. Krause et al. concluded that mLSTM outperforms standard LSTM and its deep variants for a range of character-level language modeling tasks.

LSTM with attention
The core idea behind the LSTM with Attention frees the encoder-decoder architecture from the fixed-length internal illustration. This is one of the most transformative innovations in sequence to uncover the mobility regularities in the hidden node of LSTM. The LSTM with attention was introduced by Wu et al., [24] for Bridging the Gap between Human and Machine Translation in Google's Neural Machine Translation System. The LSTM with attraction consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. Most likely, this type of LSTM continues to power Google Translate to this day.

Examples
Two examples using MATLAB for LSTM will be given for this particular chapter.
Example 1 This example shows how to forecast time series data for COVID19 in the USA using a long short-term memory (LSTM) network. The variable used in training data is the rate for the number of positive/number of tests for each day between 01/22/2020-2112/22/2020. Data set was taken from publicly available h ttps://covidtracking.com/data/national. web site and data are updated each day between about 6 pm and 7:30 pm Eastern Time Zone. The initiative relies upon publicly available data from multiple sources. States in the USA are not consistent in how and when they release and update their data, and some may even retroactively change the numbers they report. This can affect the predictions presented in these data visualizations (Figure 11a-d). The steps for example 1 are summarized in the Table 1 (MATLAB 2020b) and results are illustrated in Figure 11a-d. LSTM network was trained on the first 90% of the sequence and tested on the last 10%. Therefore, results reveal predicting the positive last 38 days.
This example trains an LSTM network to forecast the number of positively tested given the number of cases in previous days. The training data contains a single time series, with time steps corresponding to days and values corresponding to the number of cases. To make predictions on a new sequence, reset the network state using the "resetState" command in MATLAB. Resetting the network state prevents previous predictions from affecting the predictions on the new data. Reset the network state, and then initialize the network state by predicting on the training data (MATLAB, 2020b). The solid line with red color in Figure 11a and c indicates the number of cases predicted for the last 30 days. Example 2: Data from example 2 is from SEER 2017 for different age groups. SEER collects cancer incidence data from population-based cancer registries of the U.S. population. The SEER registries collect data on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, and the first course of treatment, and they follow up with patients for vital status. The example given in this chapter is the cancer type and non-cancer causes of death identified from the survey. https://seer.cancer.gov/data/. The steps for example 2 are summarized in  age groups is illustrated in Figure 12. Here, in order to input the documents into an LSTM network, the "wordEncoding(documentsTrain)" is used. This code converts the name of diseases into sequences of numeric indices. The disease names (all types of text structure) in LSTM with MATLAB are performed in three consecutive steps: 1) tokenize the text 2) convert the text to lowercase and, 3) erase the punctuation. The function stops predicting when the network predicts the end-of-text character or when the generated text is 500 characters long.

Conclusions and future work
LSTM is a very powerful ANN architecture for disease subtypes, time series analyses, for the text generation, handwriting recognition, music generation, language translation, image captioning process. The LSTM approach is effective to make predictions as equal attention is provided for all input sequences by the information flows through the cell state. Because of the mechanism adopted, the small change in the input sequence does not harm the prediction accuracy done by LSTM. Future work on LSTM has several directions. Most LSTM architectures are designed to handle data evenly distributed between elapsed times (days, months, years, etc.) for the consecutive elements of a sequence. More studies are needed to improve the predictive ability of LSTM for nonconstant consecutive observations elapsed times. Moreover, further studies are needed for possible overfitting problems for training with smaller data sets. Rather than using early stopping to avoid the overfitting, Bayesian regularized approach would be more effective to ensure that the neural network halts training at the point where further training would result in overfitting. As the Bayesian regularized approach uses a different loss function with different hyperparameters, this approach demands costly computation resources.