Parameters of VAE model.

## Abstract

Subtle distortions on electrocardiogram (ECG) can help doctors to diagnose some serious larvaceous heart sickness on their patients. However, it is difficult to find them manually because of disturbing factors such as baseline wander and high-frequency noise. In this chapter, we propose a method based on variational autoencoder to distinguish these distortions automatically and efficiently. We test our method on three ECG datasets from Physionet by adding some tiny artificial distortions. Comparing with other approaches adopting autoencoders [e.g., contractive autoencoder, denoising autoencoder (DAE)], the results of our experiment show that our method improves the performance of publically available on ECG analysis on the distortions.

### Keywords

- electrocardiogram
- variational autoencoder
- variational inference
- ECG enhancement
- deep learning

## 1. Introduction

Automatic electrocardiogram (ECG) recognition [29] is greatly helpful to doctors in their diagnosis and treatment of heart disease. As the number of portable ECG devices is increasing, more and more ECG records are available. However, it is inevitable that these ECG data are contaminated by different kinds of noise caused by such interference as baseline wandering, muscle shaking, and electrode movement [13, 14]. Considering the level and complexity of these noises, especially those components that may cause subtle deformations on ECG waveforms, these factors may decrease the accuracy of the ECG recognition. Additionally, there are much more unlabeled ECG data (i.e., there are not any type information about the data) that are stored in a lot of databases. Therefore, it is necessary to improve the performance of automatic ECG classification in unsupervised context by choosing proper models and algorithms.

In order to prevent noisy inference, many approaches of preprocessing or enhancement of ECG were successfully employed to remove the contaminations. Traditionally, most of these approaches are based on the filtering technology on frequency domain. Ziarani et al. and Konrad [15] eliminated the power line noise by extracting a specified component of a signal and tracking its variations over time. Alfaouri et.al. [16] and Dewangan et al. [17] employed wavelet transform method to isolate baseline wander and effectively detect and suppress the presence of power line interference in ECG. Although these filters can help suppress the high-frequency interference, they may drop out some useful information on the heart illness simultaneously. Because the frequency spectrum spreads not only low band but also high band. To overcome these drawbacks of filtering-based methods, some adaptive methods have been proposed. Abdelmounim et al. [18] applied adaptive algorithm to remove those noise that subsequently adapt to the wavelets selected by proper thresholding. However, the author also reported that this method had its own relative disadvantage that it had incapability of removing baseline wandering smoothly and effectively. Additionally, other technologies such as Fourier transform (FT) and empirical mode decomposition (EMD) were also employed for ECG preprocessing [19, 20]. FT maps the higher frequency components into the low area. Similarly, EMD separates different ECG components by proper intrinsic mode functions.

Feature extraction is another important procedure of ECG recognition. ECG features consists of amplitudes, intervals, and segments, which are shown in Figure 1. Each feature indicates certain activities of heart. For example, P wave represents atrial depolarization, it causes both atria to contract and pump blood to ventricles. Any distortion of P wave indicates malfunction of atrial appears.

Traditionally, the goal of ECG feature extraction is to extract all abovementioned features. As the amplitude of R wave is much larger than any others, many approaches based on the QRS complex detection have been proposed. Chan et al. [21] used a specific template to match the preferred ECG signals by the computation of the correlation between them. Krasteva and Jekova et al. [22] successfully implemented this method to evaluate the heart rhythm. Nevertheless, these approaches are heavily dependent on the prior knowledge about ECG and the relevant areas [23, 25], which cause more difficulties for further applications. Comparatively, some other approaches based on kernel functions are more popular and widely used because of their simplicity and sensitivity. Martis et al. [3] studied several methods [principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA), and discrete wavelet transform (DWT)] and compared them in feature extraction for classifying the arrhythmia ECGs. Banerjee et al. [5] focused on two specific regions (QRS complex area and T-wave region) on ECG waveforms to adequately distinguish between normal and abnormal ECG signals by yielding wavelet cross spectrum and wavelet coherence. Kærgaard et al. [6] proposed two hybrid signal processing schemes [ensemble empirical mode decomposition (EEMD) and discrete wavelet transform (DWT)] for ECG features extraction. These schemes were implemented by combining with the neural network and the wavelet transform. Nazarahari et al. [8] chose wavelet functions (WFs) as means of ECG classifying and proposed a wavelet design criterion for wavelet function choosing. Houssein et al. [4] classified the ECG by modified water wave optimization (WWO) algorithms and achieved over 93% average accuracy.

Although many important contributions have been given to ECG feature extraction by conventional methods based on kernel technologies, the accuracy and efficiency of these methods could rarely meet all the requirements of applications especially in the background of noise. Fortunately, different from the kernel methods, neural networks have been used to draw ECG features automatically by the hierarchical structure in the context of deep learning, which could be achieved by a new approach which is known as representation learning. Yan et al. [12] used a restricted Boltzmann machine (RBM) for ECG classification. Xiong et al. [9, 10] employed denoising autoencoder (DAE) and stacked contractive denoising autoencoder for ECG denoising [8], respectively. Zhou et al. [11] chose a stacked sparse autoencoder (SAE) to extract ECG feature for classifying and the level of accuracy achieved by this work shows derivable benefits over the traditional methods that require wavelets transform to perform ECG classification.

In terms of the heart illness automatically diagnosis auxiliary by the ECG recognition, some works mentioned above do not meet the necessary requirements because most studies focused on the arrhythmia distinguishing problems. Nevertheless, many heart diseases have close relationship not only with the rhythms of itself but also with the other features such as the length of the ST segment and the amplitude of P wave on the ECG waveforms. Additionally, there are rarely generative models to be used for ECG recognition. The contributions of this chapter include two aspects: (1) instead of using ECG signals on a cardiac period between two start points at P waves, we propose a new method for intercepting ECG segments between adjacent two R peaks and (2) we use variational autoencoder (VAE) model as an analysis tool to recognize different ECG signals by focusing on the variation of tiny distortion.

This chapter is organized as follows. Section 2 briefly describes autoencoder and its variants. Section 3 introduces the variational inference and variational autoencoder in detail. ECG preprocessing and classifying schema is proposed in Section 4. Our experiment results and discussions are shown in Section 5. Finally, Section 6 concludes.

## 2. Autoencoders and variants

Variational autoencoder has close relationship with autoencoder. An autoencoder is a neural network that consists of encoder and decoder. Encoder maps its input into representation and decoder reconstructs the representation back into the input, that is, perfect autoencoder can resemble the training data approximately by forcing to prioritize those aspects of the input that are helpful to resembling and discard the others. In this regard, the autoencoder learns the useful properties of training data. Comparatively, VAE shares the same character with AE besides some specialties of its own.

### 2.1. Autoencoder and regularized variants

Autoencoder can be used to get useful features from the encoder output. Generally, in the view of the feature dimension, autoencoder falls into two categories: undercomplete and overcomplete. Undercomplete means the dimension of feature is less than that of the input and more salient features could be learned well in this scenario. Conversely, in the case of overcomplete, the dimension of feature is greater than that of the input and more sparsity features might be drawn in this setting. Additionally, the objective function is another core topic for an autoencoder. It is designed to make the autoencoder have capabilities such as linear regression or logistic regression, which limit the model to some useful properties of the training data. The general form of the objective function can be depicted as follows:

where

Varied forms of regularizer terms make the autoencoder have different properties and bring us different variants of regularized autoencoder. These variants include primarily sparse autoencoder (SAE), denoising autoencoder (DAE) [3], contractive autoencoder (CAE), and variational autoencoder (VAE). Theoretically, VAE combines variance inference (VI) and neural networks. As a generative model, one of the prominent successes of VAE is that it realizes effective random sampling using back-propagation (BP) technology. This will be described in detail in Section 3.

Different from VAE, SAE makes majority of the neurons in its hidden layers be inactive since the active functions on these neurons are feasibly saturated for most input. This results in the sparsity of features, where many of the elements of the features are zero (or close to zero). In the view of mathematics, the sparsity of SAE is accomplished by the penalty term

Generally, these autoencoders share some properties. DAE and CAE are able to learn the manifold structure of the samples. Simultaneously, SAE and CAE have the similar sparsity character on their representation. Nevertheless, the implementations of these autoencoders are quite different. For example, DAE reaches the goal by using the noise-corrupted data to train the structure to learn the proper parameters that can reconstruct the original samples without any noise. Comparatively, CAE takes Jacobian matrix as part of the loss function and encourages robustness on the representation by contracting the samples during the training process.

## 3. Variational inference and variational autoencoder

As the central problem in inference analysis, posterior distribution computation is facing two computing challenges: marginal likelihood computation and predictive distribution computation. Both of them are intractable since they often require computing high-dimensional integrals. Therefore, approximate inference approaches such as Gibbs sampling based on Markov chain Monte Carlo (MCMC) principle are appealing. However, Gibbs sampling and its variants are often restricted from some applications for their inefficiencies especially in the high-dimensional scenario. This awkward situation has not been changed until the VAE was proposed theoretically [36]. To get an understanding of a VAE, we will first start from the relevant bases including variational inference (VI), evidence low boundary (ELBO), mean field, and Kullback–Leibler (KL) divergence.

To describe the problem mathematically, let

### 3.1. Variational inference

Theoretically, the motivation of variational inference [33, 35] is to find a feasible distribution to approximate the desired posterior distribution that is intractable. To measure how closeness of these two distributions are, Kullback–Leibler (KL) divergence [34] is introduced. Let

Intuitively, KL divergence is nonnegative and monotonically decreasing to the similarity of the distributions, that is, the more similar of the two distributions, the smaller the KL divergence value is. The identity equals zero when

#### 3.1.1. Evidence lower boundary

In the context of Bayesian statistics, “Evidence” is an alternative term used for the marginal likelihood of the observations. Formula (3) reveals the relationship between KL divergence and the logarithm of the evidence

Intuitively, maximizing ELBO is equivalent to minimizing the KL divergence. As the

#### 3.1.2. Mean field

To simplify the optimization problem of ELBO, it is necessary to make assumption on the family

where

Then, the ELBO can be written as Eq. (7):

where

Formula (8) indicates that the factors are all proportional to the exponentiated log the joint distribution except the

### 3.2. Variational autoencoder

As a deterministic model, general regularized autoencoder does not know anything about how to create a latent vector until a sample is input. Conversely, as a generative model, variational autoencoder (VAE) [36] emerges as a successful example of combination of variance inference and neural network. VAE forces the latent vector following some kind of distribution. These characters not only encourage the properties of the general regularized autoencoders but also expand some additional properties. For example, VAE can generate some data points even without any encoding input. It is the specialty of VAE that differs from the other regularized autoencoders. To explore VAE further, it is necessary to understand those complicated ideas such as the neural network structure, the loss function, and the optimization algorithm.

In the view of the hierarchy, the neural network structure of the VAE is mainly composed of three parts. The first part is the encoder, which is used to encode the signals from the input layer. The second part is the decoder, which is located in the right side as shown in Figure 2. The third part is the sampling unit located in the middle of the other two parts. Except for the encoder and the decoder which are similar to that of the traditional autoencoder, the additional sampling unit is responsible for sampling from the latent variables spaces.

Another issue about how to train the structure is the loss function as shown in formula (9), which is essentially the same as the negative

The last idea for VAE is the way that how to minimize the loss function of Eq. (9) as working on the neural networks, where the algorithms based on gradient decent are popularly adopted. Comparatively, it is feasible to compute the first term in the Eq. (9) as the expectation indicates the reconstruction difference and we can calculate it by the mean squared error between the output of the encoder and the decoder, as similar to that of the traditional autoencoders. However, it is more difficult to compute the second KL divergence directly as

Additionally, to train a VAE neural structure, the gradient decent should be focused on when error back propagates through the sampling layers. However, we cannot derivate the loss function over the distribution

It is clear that the gradient depends not only on the decoder’s distribution

## 4. ECG preprocessing and enhancement

In this section, we introduce our method on ECG preprocessing and enhancement. The task in this procedure is to split the ECG waves into segments according to the cardiac cycle [28] and then take them as data points for training our models. As described in Section 1, QRS complex is responsible for the activities of ventricular depolarization and repolarization, it has morphologically higher amplitude and sharper peak than other components such as P-wave and T-wave. Therefore, it is much more convenient to detect and locate Q peaks (or R, S peaks) than any other components in these ECG segments. Algorithm 1 describes the procedure of how to split ECG waveforms in detail. The templates selected in algorithm 1 are produced by the contours of the most ECG R wave peaks.

The critical step in Algorithm 1 is how to evaluate the similarity between the selected area on the ECG waveform and the given template. Generally, the mean squared error (MSE) is usually adopted in some ECG recognizing applications. However, the main disadvantage of this method is that it is time-consuming to align the selected area with the given template. For example, there are two pictures with the same curve, the similar value of the pictures may be definitely tiny if the template aligns extremely well or a very large as they do not cover each other at all. Another reasonable approach named the correlation coefficient is being currently used [21, 26]. Instead of computing directly the difference between the ECG waveform and the template as the MSE method, it solves an optimal problem that minimizes the sum of the squares of the offsets of the selected ECG data points to the corresponding points on the template.

We introduce a parameter

**Algorithm 1.** ECG R wave peak location algorithm.

1: **input:** ECG data file name **pa**

2: **initial**: set segment length

3: read ECG data into ECG data buffer

4: calculate segment number

5: **for** each segment

6: let search range in vertical direction equal start position;

7: **while not****and** **and** **do**

8: Look for R wave peak in small area of

9: **if**

10: Save the result to

11: **break**;

12: **else**

13: Update range of

14: **end if**

15: **end while**

16: update

17: **end for**

18: **return** ECG data array

Figure 3 shows ECG waveform (top picture) and the R wave peak detection and location (bottom picture). The ECG data are adopted from the American Heart Association (AHA) database on physionnet website [24], which consisted of 80 two-channel ECG recordings and digitized at 250 Hz with 12-bit resolution over a 10-mV range. The recordings in the database are divided into eight classes according to the highest level of ventricular ectopy present.

## 5. Experimental results and discussion

In this section, we evaluate the performance of VAE and other autoencoder variants described in Section 2.

### 5.1. ECG signals for multi-classification

To demonstrate the performance of our models on dealing with ECG signals, it is necessary to abstract an intact ECG signal in a cardiac period, which consists of features such as P-wave, QRS complex, and T-wave as described in Section 4. Then detection and location of P-wave becomes more critical step as every cardiac period of ECG signal starts at P-wave. However, as the amplitude of P-wave is smaller than that of QRS complex, and there are many kinds of noise on ECG singles. These factors enlarge the difficulties of abstraction of ECG signals in a cardiac period.

Our solution to alleviate this problem is offered by the fact that it is more feasible to locate R-peaks than to locate the start position of a P-wave. Instead of focusing on the cardiac period, we separate one cardiac period into two semi-cardiac periods at R-peak and then take two parts of the adjacent ECG signals together to form a new period ECG signal, which consists of the second part of the previous cardiac period and the first part of the next one. Figure 4(a) shows an example of an ECG signal that is composed of two parts of the adjacent semi-period. Additionally, in the view of information, there is no any feature lost in this separation.

The original ECG recording from ECG database contains several hours of ECG data, and it is unfeasible to train our models using these original ECG data directly. To train our models well, 30,000 ECG signals are abstracted completely from three different ECG databases. The AHA ECG database, the APNEA ECG database [24], and CHFDB ECG database [24]. Additionally, for ECG data augmentation [32], these ECG data are divided into three different groups according to their source databases and each group has 10,000 ECG signals. On this basis, we augment the ECG data by zeroing a small segment on ECG signals and different positions we selected to zero correspond to different class labels. Figure 4(b)–(d) are three examples of our augmentation. Concretely, the labels of Figure 4(b)–(d) are 3, 4, and 5, respectively. (We use numbers 1–8 as eight labels for different class of ECG signals in all of our experiments. We add labels for the different classes of ECG signals, not for training our models but for simplifying evaluating the accuracy of our models in testing process.)

To evaluate the properties of our models on denoising for ECG signals, different type noise on different level are added into the original ECG records. These noise include Gaussian noise, salt and pepper noise, and Poisson noise. Moreover, to imitate baseline wandering noise, different amplitude sinusoidal signals are superimposed on the original ECG signals. The coefficients of the sinusoidal signal are 0.01, 0.05, and 0.1, respectively in all of our experiments. Figure 5shows the ECG signals polluted by different noises. Figure 5(a) and (c) show the augmented ECG signals without adding noise except for some one polluted during sampling. Figure 5(b) shows ECG signal polluted by the sinusoidal noise and the Gaussian noise. The coefficients for the sinusoidal and for the Gaussian are all 0.01. Nevertheless, the coefficients for the sinusoidal and for the Gaussian are 0.05 and 1 as shown in Figure 5(d). The mean and variance of the Gaussian noise are 0 and 0.01, respectively.

### 5.2. Recognization of ECG signals

After ECG signals have been abstracted completely by the methods described in Section 5.1, they are used to train VAE model. To compare the effect of the complexity of ECG data on our model, all ECG data are divided into two groups. The first one contains only two classes of ECG records, normal or abnormal. (We call this group as BI dataset) The normal ECG records mean those ones that contain all normal features as shown in Figures 4 and 5. The abnormal ECG records in BI dataset contain at least one abnormal feature such as prolonged PR interval, enlarged P-wave, and absence of T-wave. The second group contains 8 classes of ECG records, each of them are produced by zeroing a small segment of ECG data as described in Section 5.1 (We call this group as MI dataset). In order to verify the performance of the VAE model on ECG signals, the parameters of the model are shown in the Table 1. Table 2 shows the performance of the VAE model on recognizing these ECG signals from both BI and MI datasets. The results clearly show that the accuracies of recognition are higher than 95% for MI recorders and even more than 97% for BI recorders. In the view of the data complexity, the result is reasonable because the complexity of MI is much higher than that of BI.

Parameter name | Value | Comment |
---|---|---|

Input size | 400 | Equal the length of signal |

h1 | 100 | First layer of the encoder |

h2 | 10 | Second layer of the encoder |

z-mean | 2 | Mean of the sampler |

z-variance | 2 | Variance of sampler |

Learning rate | 0.01 | |

Function | Log-sigma | Logarithmic sigma |

Optimizer | AdamOptimizer | |

Batch size | 100 | Randomly select samples from the dataset |

DB | Record | ECG no. | Sample no. (10^{3}) | Class no. | Precision (%) | Error (%) | |
---|---|---|---|---|---|---|---|

ahadb | 0001 | 0 | 10 | 2 | 97.70 | 2.30 | |

ahadb | 0001 | 0 | 10 | 8 | 96.31 | 3.69 | |

ahadb | 0001 | 1 | 10 | 2 | 96.63 | 3.37 | |

ahadb | 0001 | 1 | 10 | 8 | 93.91 | 6.09 | |

ahadb | 0201 | 0 | 10 | 2 | 99.87 | 0.13 | |

ahadb | 0201 | 0 | 10 | 8 | 96.58 | 3.42 | |

ahadb | 0201 | 1 | 10 | 2 | 98.10 | 1.90 | |

ahadb | 0201 | 1 | 10 | 8 | 98.25 | 1.75 | |

APNEA | a01 | 0 | 0.7 | 2 | 98.02 | 1.98 | |

APNEA | a01 | 0 | 0.7 | 8 | 97.56 | 2.44 | |

APNEA | a02 | 0 | 0.8 | 2 | 99.87 | 0.13 | |

APNEA | a02 | 0 | 0.8 | 8 | 95.74 | 4.26 | |

CHFDB | Chf01 | 0 | 10 | 2 | 99.99 | 0.01 | |

CHFDB | Chf01 | 0 | 10 | 8 | 97.65 | 2.35 | |

CHFDB | Chf01 | 1 | 10 | 2 | 98.89 | 1.11 | |

CHFDB | Chf01 | 1 | 10 | 8 | 96.45 | 3.55 | |

CHFDB | Chf01 | 0 | 10 | 2 | 99.75 | 0.25 | |

CHFDB | Chf01 | 0 | 10 | 8 | 96.78 | 3.22 | |

CHFDB | Chf01 | 1 | 10 | 2 | 99.26 | 0.74 | |

CHFDB | Chf01 | 1 | 10 | 8 | 97.92 | 2.08 |

Advantages of VAE model on recognization ECG signals can be further shown by comparasion with other autoecoders such as CAE,DAE, and SAE mentioned in Section 2. In order to make the comparison be fair and reasonable, all of the parameters of the model are the same exept for that of the sampler in VAE model (the values of the parameters can be seen in Table 1). Moreover, the ECG records of BI and MI from ahadb database are used to train and test all the models. Figure 6 shows the accuracy of the models on recognizing the ECG records. Both **(a)** and **(b)** in Figure 6 take the rate of the representation to the input on size as variable. Figure 6(a) takes the BI ECG records from the ahadb as the datasource for the models. Conversely, the MI records from the same dataset are selected in Figure 6(b). It is clear that the accuracy of the VAE model is higher than that of the other models on both BI and MI ECG records, which is at leat 95% on BI records and no more than 90% on MI records. Meanwhile, both figures indicate a fact that the proper rate for the accuracy on the same condition is at 1. The accruy is near 80% when rate falls at 0.5. Simlarly, the accury drops sharply as the rate rise up. Therefore, there is no necessary for representation of ECG signals to compress (rate < 1) or stetch (rate > 1) themselves.

Figure 7 demostrates the performance of the VAE model on denoising for ECG records. The method of adding noise into ECG records in our experiment can be seen in Section 5.1. The coefficient for sinusoidal is 0.05 and the mean and the variance of Gaussian noise are 0 and 0.05, respectively. For the goal of comparison, we take four groups of ECG records (BI, noisy BI, MI and noisy MI) as dataset for the VAE model.

The results show that the accuracy under noisy condition is similar to that of without noise on the same dataset. This means that performance of VAE model on ECG recognition is robust to some kinds of noises.

## 6. Conclusions

In this chapter, we develop a VAE model to recognize a tiny distortion on ECG signals. First, we analyze the characteristics of the features of the ECG signals, which are closely related to ECG components such as P-waves, QRS complex, and T-waves. Second, we explain an algorithm that deals with the location of R peaks. On the basis of the algorithm, we abstract a segment of ECG signal between two adjacent R peaks from three real-life ECG databases. Finally, we train our models by using the selected ECG signals. The results of our experiments demonstrate that the proposed VAE model can be used as an effective tool to automatically recognize ECG signals. Especially, this model is robust to some kinds of noises that are usually produced during the sampling procedures. Furthermore, as a generative model, VAE is a recently established based on the neural networks. The important characteristic of the model is that it can be used in the scenario of the unsupervised learning [31]. Simultaneously, with the emergence of the large amount of unlabeled ECG records and the requirement for real-time diagnosis of heart illness by automatic recognition ECG signals, our method in this chapter can offer a solution to these problems.

In the view of the clinic, future work should put more energy on setting up the set of features of ECG signals, especially, the relationship between the features and the heart diseases. Additionally, because of the physiological characteristics of heart, a single ECG wave may not accurately represent the entire situation of the heart, it is therefore desirable to obtain all of ECG signals from all of 12 or 18 leads. For example, if an anterior wall myocardial infarction happens. Feature of ST-segment elevation reciprocally changes on the ECGs from the leads of I, aVL, and V1–V5. Therefore, the general implementation of VAE model to such clinic situations warrants further study.