Toward Deep Learning-Based Human Target Analysis Toward Deep Learning-Based Human Target Analysis

In this chapter, we describe methods toward deep learning-based human target analy- sis. Firstly, human target analysis in 2D and 3D domains of radar signal is introduced. Furthermore, range-Doppler surface for human target analysis using ultra-wideband radar is described. The construction of range-Doppler surface involves range-Doppler imaging, adaptive threshold detection, and isosurface extraction. In comparison with micro-Doppler profiles and high-resolution range profiles, range-Doppler surface contains range, Doppler, and time information simultaneously. An ellipsoid-based human motion model is designed for validation. Range-Doppler surfaces simulated for different human activities are dem onstrated and discussed. With the rapid emergence of deep learning, the development of radar target recognition has been accelerated. We describe several deep learning algorithms for human target analysis. Finally, a few future research considerations are listed to spark inspiration.


Introduction
Human target analysis is acknowledged to be useful for a wide range of security and safety applications, such as through-wall detection and ground surveillance [1,2]. The analysis has usually been conducted in the time-frequency (micro-Doppler) and time-range (high-range resolution profile) domains. Chen first introduced the micro-Doppler concept to the radar community in [3]. Time-frequency transforms, such as the short-time Fourier transform, are used to analyze target Doppler signatures in slow time. Since then, studies have investigated micro-Doppler-based target feature analysis [4][5][6][7][8][9]. The high-range resolution profile (HRRP) of human targets has also been studied [10]. Micro-Doppler profile and HRRP, which are both generated by micromotions, have their shortcomings: they only contain information from either the time-frequency or time-range domain. Micro-Doppler analysis neglects range information, while HRRP analysis neglects Doppler information.
Therefore, in order to analyze the target signature more comprehensively, we describe a new concept called range-Doppler surface (RDS). As an alternative to the micro-Doppler profile and HRRP, RDS is a radar target representation extracted from a three-dimensional data cubethe range-Doppler (RD) video sequence [11,12]. The RDS consists of all the important information contained in both HRRP and micro-Doppler signatures. The present study analyzes the RDS using simulated and real radar data. Results suggest a new area of human target analysis and classification.
It is worth mentioning that the term "range-Doppler surface" has been presented in prior works [13][14][15]. It was used for a 2D range-Doppler image that is shown in a 3D perspective. In this chapter we present this term to describe the time-varying range-Doppler isosurface information. RDS is referred to 3D visualization for the first time in this study, and it is indeed a suitable, precise term to describe this concept.
Nowadays, deep learning has become a mainstream method for human activity recognition instead of conventional machine learning methods. Deep learning came into our sight and has emerged as a hot topic in the past few years. It works by learning several layers of representation for modeling the complex relationships among data. It can create high-level features from related low-level ones by means of its hierarchical architecture without artificial extraction from the raw data and specialized knowledge. In this way, it makes activity recognition system more intelligent and versatile. Therefore, deep learning is an applicable approach to identify human activities.
This chapter is organized as follows: In Section 2, human target analysis with the micro-Doppler profile, HRRP, as well as the three-dimensional RD video sequence is described. In Section 3, the range-Doppler surface is described. Then, deep learning for human activity classification is introduced briefly in Section 4. Finally, future directions are drawn in Section 5.

Human target analysis with RD video sequence
The RD video sequence that consists of N time sampled 2D range-Doppler images contains both spatial and temporal characteristics: range and frequency information consists in every RD image, while time information exists among sequential frames. Compared with 1D and 2D forms of radar echoes, the joint time-/range-frequency form of echoes contains all the targets' motion information.
Among these human target analysis systems using the RD video sequence, so far, a representative example is Google Project Soli, which is the first gesture recognition system capable of recognizing a rich set of dynamic gestures based on short-range radar [27,28]. It is based on an end-to-end trained combination of deep convolutional and recurrent neural networks, and the dataset is comprised of 3D data cubes. Combining CNN and RNN could enhance the ability to recognize different activities that vary in temporal and spatial dimensions. The system can recognize subtle gestures of 10 kinds performed by 10 people. From then on, many researches have been done using radar just like Kinect and Leap Motion in CV [21]. In addition, with the advent of this system, many novel ideas have been proposed based on it [22,29,30].

Human target analysis with HRRP and micro-Doppler profiles
Although containing abundant information of human activity properties, 3D form of human backscattering echoes is complicated to process. As a result, the complexity of the systems using 3D form of echoes is higher compared with those using 2D forms. 2D forms of radar echoes, which mainly refer to HRRP and micro-Doppler profiles, also carry enough human activity information and can be used for human target analysis.
It is a common way to obtain time-Doppler maps, namely, joint time-frequency transformation (JTFA) [23,24]. Similar to the developments in other fields such as acoustics and speech processing, JTFA can provide additional insight into the analysis, interpretation, and processing of radar signals, and the performance is superior to what has been achieved in the traditional time or frequency domain alone [3].
The short-time Fourier transform (STFT) is the most commonly used time-frequency transform. STFT performs the Fourier transform on a short-time sliding window basis instead of using one long-time window to the entire signal. The square modulus of the STFT is called the spectrogram, which is a nonnegative time-frequency energy distribution without phase information [20]: The resolution of STFT is determined by the size of window function. There is a trade-off between the time resolution and the frequency resolution [4].
By performing a STFT over time for every range bin, a series of 2D time-Doppler images along range can be acquired. Then, summing the time-Doppler "video" along range, a time-Doppler map is obtained.
Among these three 2D forms of echoes, till now, the time-Doppler maps are most commonly used to analyze human targets. The time-Doppler maps include a wealth of Doppler information changing over time. The main Doppler shift is caused by the bulk speed, while micro-Doppler is produced by rotating or vibrating parts, such as the legs, feet, and hand. By selecting and classifying the micro-Doppler features in the time-Doppler map, human activities can be recognized by various models. For example, G. Klarenbeek et al. applied a LSTM structure with the time-Doppler maps to realize the multi-target human gait classification [26].
Time-range maps contain time-varying range information between the target and the radar. When a person is moving, different components of the human body have different relative distances from the radar at time t. Therefore, various time-range maps produced by different activities can be used to recognize the corresponding activities, although they neglect the Doppler

Range-Doppler processing
In general, separating the frequency components of different body parts is a vital step for human target analysis. However, the ability to resolve separate frequency components is limited because of the time-varying Doppler shifts in radar echoes. A general way of representing the scattered signals is range-Doppler (RD) processing. And a classical way for RD processing is to apply Fourier transform to samples from a fixed range bin over one coherent processing interval. This interval is theoretically limited by the time during which the target stays in the same range bin: The radar transmits a coherent burst of M pulses: where t is the fast time, p ( t ) is the pulse complex envelope, T r is the pulse repetition interval, and f c is the center frequency. The echo of a moving point scatterer can be described by a delayed and attenuated version of the transmitted signal in Eq. (3): where α is the complex amplitude and τ ( t ) is the round-trip time delay. Assuming that the radial velocity v is constant during this period, the round-trip delay can be approximated by where τ 0 = 2 R 0 ___ c is the initial delay, R 0 is the initial range, and c is the speed of light. Substituting Eqs. (3) and (5) into Eq. (4) leads to where all the constant terms have been absorbed and the variable name has been kept as α .
Then, for notational convenience the fast time t ′ was introduced: t ′ = t − mT r . Substituting t ′ into Eq. (6), after demodulation the baseband signal is given as In radar, we usually assume that the target remains static in one pulse duration T : is the range resolution of the radar and B w is the signal bandwidth. This indicates that the Doppler effect in fast time is negligible. By simplifying, Eq. (7) can be rewritten as FT is performed over t s in Eq. (9) and the Doppler spectrum is given: where However, a target may traverse multiple range cells in one coherent processing interval sometimes. In this situation, applying conventional Fourier transform-based RD processing is not suitable, which will lead to a blur in frequency. To improve the Doppler resolution, a Keystone transform-based range-Doppler processing is proposed in this chapter.
By performing RD processing on the radar echoes, a sequence of RD images is obtained. The threedimensional RD video is proposed to describe the slow-time evolution of target RD signatures.

Range-Doppler surface construction
Before constructing the RDS, it is essential to detect the target in the range-Doppler domain, since detection allows the extraction of targets and elimination of false alarms. The cell-average constant false alarm rate (CA-CFAR) procedure [16] is a classical approach of detecting a target in noise and clutter. Detection is performed employing a two-dimensional CA-CFAR procedure [17] in the range-Doppler domain. For each range-Doppler image in the range-Doppler video sequence, a sliding 2D window is applied to scan this RD image pixel by pixel. For each pixel, it is claimed as detected if its intensity exceeds an estimated threshold. In Figure 1, a typical 2D window is shown. The cell under test covers the target reflections. The reference cells estimate background noise for computing the detection threshold. The guard cells separate the cell under test and reference cells as a barrier. The sizes of these cells strongly affect the performance of the CA-CFAR detection and thus should be tuned carefully according to radar parameters and target characteristics (e.g., signal bandwidth, maximum unambiguous Doppler, and target velocity).
In Figure 2a, the detected scatterers of a simulated human target are shown in a three-dimensional (3D) volume, where the intensities of different scatterers are represented by various colors. Note that the simulated radar system uses the same parameters as used in generating Figure 3. Finally, the RDS (see Figure 2b) is constructed by creating a surface that has the same intensity value within the 3D range-Doppler-time volume (i.e., range-Doppler video sequence) in Figure 2a. Isosurface plots are similar to contour plots in that they both indicate where values are equal. The MATLAB® function isosurface is applied to extract the isosurface from the volume using a user-defined isosurface threshold. The isosurface connects points that have the specified value much the way contour lines connect points of equal elevation. Note that the difference of the surface color in Figure 2b is not due to different intensities, but due to the lighting effect used to illustrate the 3D object in MATLAB®. Selecting a reasonable threshold is important in this procedure, because this affects the final output significantly. Although currently the threshold is set manually, automatic approaches to construct the volume surface are certainly interesting in future studies.
Target analysis has been commonly done in the time-range domain or time-frequency domain.
As mentioned above, HRRP neglects Doppler information, while micro-Doppler neglects range information. Furthermore, micro-Doppler is difficult to be used in multi-target situations, since the Doppler spectrums of different targets may overlap. The RDS shows the target surface in the 3D range-Doppler-time space. All the important targets' information, which might be contained in HRRP and micro-Doppler, is included in RDS. Figure 4. The responses of the feet are well separated in either range or Doppler, and the responses of the thorax and hands overlap with each other. The feet have a larger Doppler offset than the thorax and hands.

RDS of measured human activities
PulsOn 400 radar system, manufactured by Time Domain Corporation [18], was used to acquire measurement data (experimental setup; see Figure 5). Its operational frequency band is 3.1-5.3 GHz, and the signal is modulated by an m-sequence. The transmitted power is −14.5 dBm. The pulse repetition frequency is 200 Hz, and the sampling frequency is 16.39 GHz. More details about this radar are given in the literature [19].
In the measurement, two scenarios were considered: single-person walking and two-people walking. The RDSs generated for these two scenarios are presented in Figures 6 and 7, respectively. The RDS for the single-person scenario is similar to the simulated RDS shown    in Figure 2b, and the capability of RDS to separate body segments is demonstrated again in Figure 6d and e. It should be noted that in the processing for real data, static clutter is removed via moving target indication before constructing RDS, which cancels the clutter and also the stationary parts of the human body. More precisely, in each walking step, the reflection from the stationary leg/foot is rejected, while the reflection of the moving leg/foot is retained.    In Figure 7, the RDS of the two-people scenario shows that the backscattering of the human targets is automatically separated in the 3D range-Doppler-time space. This indicates that RDS is not only able to show the range-Doppler signatures of a single extended target but also able to separate (or even track) multiple targets in the range-Doppler video sequence. Additional processing to separate multi-target reflection (e.g., the separating method proposed in [19]) is not required anymore.
As an example, the RDS has been demonstrated for human target analysis using an S-/C-band UWB radar, but RDS itself is in fact a generic tool. It can be used in various applications, such as feature extraction, tracking, or classification. Similar to micro-Doppler or high-resolution range profiles, the RDS has the potential of being used to analyze different types of targets (e.g., people or animals) and also used in different types of radars (e.g., S-band, C-band, or X-band radar).

Deep learning for human target analysis
With the rapid emergence of new deep learning algorithms and architectures, the development of many domains such as speech recognition, visual object recognition, object detection, and even drug discovery and genomics has been accelerated. Deep learning is composed of multiple processing layers to learn high-level representations with multiple levels of abstraction, thus automating the process of feature extraction. Hence, deep models do not need heavy feature engineering and domain knowledge compared with traditional machine learning techniques. What is more, with so many complicated deep-level transformations, very complex functions can be learned, and more classification and recognition problems can be solved. As a result, deep learning has made great contributions in overcoming difficulties in artificial intelligence and advancing the development of artificial intelligence.
Next, we will mainly describe several deep learning models, which are used mostly in human target analysis field.

Convolutional neural network
Convolutional neural network (CNN) is inspired by the visual cortex structure which is composed of simple cells and complex cells. It adopts four key ideas: local connections, parameter sharing, pooling, and multilayers. In this way, CNN is able to fully explore the property of raw signals that there are compositional hierarchies, namely, extracting higherlevel features from the lower-level ones. As a result, convolutional neural networks, as one of the representative algorithms of deep learning, have made a remarkable progress in object detection and recognition, natural language processing (NLP), speech recognition, and medical image analysis in the past few years. In human activity recognition field, CNN is one of the most used deep learning models. For instance, Zhenyuan Zhang et al. have adopted this network to realize continuous dynamic gesture recognition using a radar sensor [30], while Youngwook Kim et al. detected and classified human activities using deep convolutional neural networks [32].

Recurrent neural network
With the successful application in NLP, recurrent neural network (RNN) has caught researchers' attention. RNNs have shone light on modeling temporal sequences such as texts and speeches because they can mine timing and semantic information in them. From the perspective of network structure, RNN can remember the previous information and use it to influence the output of the following nodes. However, conventional RNN has its own limit: long-term dependencies. To overcome this problem, long short-term memory (LSTM) came into being and performed better in many tasks. LSTM owns three special gates: input gate, output gate, and forget gate. By using these memory units especially the forget gate, LSTM can access a long-range context of the sequential data. Due to these advantages above, many human activity recognition systems adopted RNN and its variants. Zhi Zhou et al. adopted multimodal signals, including HRRPs and Doppler profiles, which are acquired by the terahertz radar system to recognize dynamic gestures, and the recognition rate reaches more than 91% [22].

Auto-encoder
Auto-encoder is a high-performance deep learning network suitable for dealing with one-dimensional data by extracting optimized deep features. It learns a deep feature representation of raw input via several rounds of encoding-decoding procedures. Auto-encoder applies the layer-wise greedy unsupervised pre-training principle so as to quickly obtain an efficient deep network.
The commonly used variants of auto-encoder are mainly the following kinds: (1) sparse autoencoder, which is able to rebuild the input data well, and (2) de-noising auto-encoder and contractive auto-encoder which can make the models more generic by adding noise or a wellchosen penalty term.
Auto-encoder is able to provide a powerful feature extraction approach for many tasks, which saves a lot of labor. In this way, auto-encoder can combine with whether conventional machine learning algorithm or other deep learning models and becomes a more robust one. Mehmet Saygin Seyfiolu et al. [33] used a convolutional auto-encoder architecture to discriminate 12 indoor activity classes involving aided and unaided human motions by recognizing different 2D Doppler maps, and Branka Jokanovic et al. [34] applied three stacked auto-encoders to extract deep features, respectively, and fuse the result together with a voting principle to classify activities.

Future directions
Despite plentiful human target analysis researches have been done with all kinds of deep learning methods and the effect is considerable, there are still many challenges and opportunities. Next, a few future research considerations will be listed below.

Distinguish radar images from natural images
Among three forms of backscattered radar signals mentioned above, 2D domain radar signals such as time-Doppler maps and time-range maps are mostly used for recognition because they are represented in two dimensions and look more intuitive. Furthermore, these deep learning models are usually introduced from the field of computer vision. In CV area, the images are natural images but the radar images are not. This will lead a doubt that it is proper or not to treat radar 2D images as natural images completely. As a result, it is very urgent to create some techniques to distinguish more radar images with natural images.

Notice phase information
Common energy-based power spectrograms after FT or STFT always abandon the phase information in backscatter echoes. But phase is an important attribute of any signal and contains a wealth of information such as transmission duration and distance. Pavlo Molchanov et al. investigated frequency and phase coupling phenomena for radar backscattered signals and proposed novel bicoherence-based information features [31]. We think phrase information in radar backscattering signals should be considered more in future studies.

Take orientation sensitivity into consideration
Doppler shift is caused by the radial velocity of the moving target. The radial velocity changes with the position of the target and the radar because it is the component of the object's velocity. In other words, when the radar is above the pedestrian, the Doppler is partly induced by the motion vertical component such as arm and leg vertical motions. In this case, negative Doppler will appear. As a result, if the relative position is different, radar backscattered signals produced by one subject performing a specified activity will differ a lot. How to overcome the orientation sensitivity of radar-based HAR is one of the future research topics.

Focus more on 1D and 3D domain radar echoes
Through the investigation of the current research status, compared with the researches in 2D domain, there are few research results on 1D and 3D domains of human echo signals, but through the discussion in previous chapters, we have reason to believe that the two forms of echoes have enough development potential and explore space. Thus, more attention should be paid to this part of human target analysis field.