Comparison of Siamese trackers over OTB2013 and OTB2015 benchmarks.
Recently, Siamese neural networks have been widely used in visual object tracking to leverage the template matching mechanism. Siamese network architecture contains two parallel streams to estimate the similarity between two inputs and has the ability to learn their discriminative features. Various deep Siamese-based tracking frameworks have been proposed to estimate the similarity between the target and the search region. In this chapter, we categorize deep Siamese networks into three categories by the position of the merging layers as late merge, intermediate merge and early merge architectures. In the late merge architecture, inputs are processed as two separate streams and merged at the end of the network, while in the intermediate merge architecture, inputs are initially processed separately and merged intermediate well before the final layer. Whereas in the early merge architecture, inputs are combined at the start of the network and a unified data stream is processed by a single convolutional neural network. We evaluate the performance of deep Siamese trackers based on the merge architectures and their output such as similarity score, response map, and bounding box in various tracking challenges. This chapter will give an overview of the recent development in deep Siamese trackers and provide insights for the new developments in the tracking field.
- Siamese networks
- visual object tracking
- deep learning
- neural network
- end-to-end learning
In the past few decades, visual object tracking (VOT) has become a promising and attractive research field in computer vision area. It became popular among researchers due to its wide range of applications including autonomous vehicles [1, 2], surveillance and security [3, 4], traffic flow monitoring [5, 6], human computer interaction [7, 8] and many more. Popularity in the field is because of various tracking challenges and opportunities. In recent years, researchers have made remarkable endeavors and developed a number of state-of-the-art trackers to handle various tracking challenges. Despite the fact that significant progress has been made in the field but still trackers have not achieved consummate performance and VOT is still an open challenge yet to be fully addressed. Various challenges to be handled by VOT include fast motion, motion blur, occlusion, deformation, illumination variations, background clutter, in- or out-planer rotations, out-of-view, low resolution, and scale variations.
The objective of VOT is to identify a region of interest in video frames. VOT consists of four sequential components such as target initialization, target appearance modeling, motion estimation, and target localization. In target initialization, the region of interest is annotated using any of the representations including ellipse, centroid, object silhouette, object skeleton, object contour, or object bounding box. In generic object tracking, the position of the region of interest as the target is given in the first frame of a video and the tracking algorithm predicts the target location in the rest of the frames. The target appearance model represents a better target feature representation and a mathematical model to identify the region of interest using learning methodologies. While the target motion estimation module predicts the position of the target in sequential frames by either greedy search or maximum posterior prediction. The tracking problem is simplified as the constraints applied over the target appearance model and motion estimation. During tracking, both appearance and motion models are updated to capture the new target appearance and its behavior.
In this chapter, we focus on monocular, casual, model-free, short-term, and single-target trackers. The causality means that a tracker has the ability to estimate the target location in the current frame without prior information of the future frames. While model-free characteristic stands for supervised learning where target bounding box is given in the first frame of the video. Finally, short-term denotes that during tracking, a tracker is unable to re-detect the target once it is lost.
The performance of the trackers is highly affected by the feature representations. Features are broadly classified into hand-crafted (HC) and deep features. Traditional features are known as HC features such as histogram of oriented gradients (HOG), local binary patterns (LBP), color names and scale-invariant feature transform, etc. Nowadays, computer vision researchers are selecting deep features for better representation. Deep features are more capable to capture multi-level information and to encode the target appearance variant features compared to HC features. Deep features are extracted using different methods such as convolutional neural networks (CNN) , recurrent neural networks (RNN) , auto-encoder , residual networks , and generative adversarial networks (GAN)  for different computer vision applications.
In recent years, CNN-based methods have been adopted in various computer vision tasks and gained popularity due to improved performance in face verification , image classification , semantic segmentation , medical image segmentation , object detection , etc. An empirical and comprehensive study performed by Fiaz et al.  showed that deep trackers have shown an improved performance compared to HC feature-based trackers. The discriminative power of state-of-the-art deep trackers is explored by employing deep features. It is difficult to train a discriminative deep tracker efficiently due to data-hungry property. Various deep trackers are developed to handle scarce training data problem by employing shallow features extracted from pre-trained off-the-shelf models such as AlexNet , VGGNet , etc. Nevertheless, these approaches do not fully benefit from end-to-end learning. Deep trackers that apply stochastic gradient descent (SGD) methods are not real-time because they take a lot of time to fine-tune the multiple layers of the network.
In order to handle those restrictions, a simple advocate approach known as Siamese network is utilized to compute the similarity between the two input images. Siamese networks are trained offline to learn the similarity between two input images and are evaluated online without fine-tuning for new target estimation. In this chapter, we study different types of Siamese networks developed for tracking. We also present an experimental study to analyze the performance of the Siamese trackers over OTB2013  and OTB2015  benchmarks.
2. Related work
In the literature, there exist many comprehensive studies on VOT. Each study focuses on specific research aspects going on in the field. Fiaz et al.  classified the tracking algorithms into correlation and noncorrelation filter-based trackers. An extensive experimental study was performed over hand-crafted and deep feature trackers. Similarly, Li et al.  also studied the deep trackers and categorized deep trackers into three classes including network structure, network function, and network training. Leang et al.  discussed single target trackers while Zhang et al.  performed their study over the sparse trackers. Yang et al.  focused on the context information by considering auxiliary objects as the target context of the tracking object.
These studies have been performed by tireless efforts made by the research community and developed various state-of-the-art trackers. The tracking algorithms can be classified as tracking by detection, discriminative correlation filters, deep convolutional neural networks, and Siamese network-based trackers.
2.1 Tracking by detection-based trackers
In many tracking algorithms, classifiers are considered as the fundamental part to discriminate the target object from nontarget objects such as support vector machine (SVM), random decision forest, as well as various boosting-based classifiers. Classifiers are updated to integrate the new target appearance during online learning in various tracking by detection algorithms. For example, multiple instance learning framework proposed by Babenko et al.  employed gradient boosting to learn the classifiers. Hare et al.  utilized structured output to estimate the target location and employed SVM for online adaptive tracking. Zhang et al.  applied Bayes classifiers for online adaptation of the target over a multi-scale feature space built on a data-dependent basis.
2.2 Discriminative correlation filter-based trackers
The development of trackers based on correlation filters has boosted the tracking performance. Bolme et al.  proposed a fast tracker by minimizing the sum of squared error (SSE) between the actual output and the desired output in the frequency domain. Kernelized correlation filters (KCF)  utilized the multi-channel features using circulant matrices in the Fourier domain and used the Gaussian kernel function to discriminate a target from the background. The discriminative correlation filter trackers have their own limitations such as they require to fix model and patch sizes. A model may learn undesired information resulting in reduced performance. SRDCF  introduces a spatial regularization method in discriminative correlation trackers to reduce the effect of background information by penalizing it. SRDCFdecon proposed by Danelljan et al.  tackled the contaminated training samples to improve robustness. Li et al.  proposed STRCF that integrates the temporal regularization in SRDCF using a passive-aggressive algorithm to improve the tracking performance. CSRDCF  incorporates the channel and spatial reliability within correlation filters. CSRDCF integrates the spatial reliability using a spatial binary map at the target location, while the channel reliability by estimating the channel and detection reliability metrics.
2.3 Deep convolutional neural network-based trackers
Deep convolutional neural networks have presented an outstanding performance in many computer vision applications. Deep learning has limitations due to limited training data and high computational cost. However, much progress has been made and many state-of-the-art deep trackers have been proposed. Nam and Han employed CNN to develop a multi-domain adaptive deep tracker . Nam et al.  integrated CNN in a tree structure to model the target appearance. A tree is constructed from multiple hierarchical CNN-based target appearances. Ma et al.  exploited the rich hierarchical deep features using correlation filters. Qi et al.  hedged the weak classifiers and obtained a strong classifier by captivating the benefit from multi-level deep features.
2.4 Template matching-based trackers
Tracking by matching is one of the most basic concepts in tracking where target pixels are directly compared with the input patches from the video. Briechle and Hanebeck  introduced the simplest template matching mechanism in tracking via a normalized cross-correlation. TLD-tracker  also employs normalized cross-correlation mechanism. Later on, many template matching trackers focused on distorted tracking objects. Wang et al.  performed matching using super-pixels. Nguyen and Smeulders  used color invariants to discriminate targets from the background. Godec et al.  employed HOG features for probabilistic matching. Held et al.  used deep regression networks for matching. Bertinetto et al.  exploited fully convolutional features to compute the correlation between the target and the search patches.
In this section, we noticed that various tracking algorithms have been proposed to solve the tracking problem but still research area is active. We also observed that there exist different comprehensive surveys that focus on various tracking frameworks. On the contrary, we present a study on Siamese networks employed in tracking. We categorized the Siamese trackers into three categories. Moreover, we also evaluated the robustness of the different Siamese trackers.
3. Siamese networks for tracking
In correlation filter-based trackers, a response map is computed between a target template and a candidate patch in the Fourier domain. In object tracking, the center of the target is focused and a weight matrix is trained such that it minimizes the squared error from the target . The tracking problem can be defined as a regression problem which depicts a closed-form solution and is formulated as
where is the search space feature vectors, is a regularization parameter, and || . ||2 means the ℓ2-norm of a vector. The solution for Eq. (1) is described as:
Since Eq. (2) has high computational cost due to inverse matrix computation, thus cannot be used directly for tracking. Hence, the described problem can be resolved in the dual form as follows:
where α denotes the discriminatory part. For tracking problems, the challenge is to optimize α in dual form solution in Eq. (3).
Another alternative approach is to learn a similarity function to compare the similarity between the template image and the candidate image. A Siamese network architecture is a Y-shaped network that takes two images as inputs and returns similarity as output. Siamese networks determine if the two input images have identical patterns or not. The concept of Siamese was initially introduced for signature verification and fingerprint recognition, and later adapted in many computer vision applications such as large scale video classification , stereo matching , face recognition and verification , and patch matching  etc. A series of state-of-the-art Siamese-based trackers have been proposed in the past few years. We observe that Siamese-based trackers utilize embedded features by employing CNN to compute the similarity. By analyzing the architecture of deep Siamese trackers, we classify them into three categories based on layer position of the merge; (i) late merge, (ii) intermediate merge, and (iii) early merge architectures as shown in Figure 1.
Late merge: the input images are processed separately by two individual parallel networks and are merged at the last layer of the network (Figure 1(a)).
Intermediate merge: the input images are processed separately in the initial part of the network and then merged well before the final layer (Figure 1(b)).
Early merge: the input images are stacked before feeding to the network and then a unified input is fed forward to the network for inference (Figure 1(c)).
We also observe that Siamese-based trackers produce different types of output such as similarity score, response map, and bounding box. Siamese-based trackers with similarity score as output mean that they return the similarity as probability measure, whereas the response map means a two-dimensional similarity score map. The maximum value in the similarity map represents the location of maximum similarity between two patches and low value for the dissimilar region. Some Siamese-based trackers directly yield the bounding box location of the target.
3.1 Siamese late merge trackers
This subsection studies the tracker where the two input images are fed forward to two separate CNN models and are merged at the final layer to get the final response.
Siamese instance search tracker (SINT) is proposed by Tao et al. . SINT learns an offline matching function and estimates the best-matched patch for incoming frames in a video (Figure 2). The architecture of SINT consists of two streams including query stream and search stream. Each stream is composed of 5 convolutional layers, 3 region-of-interest pooling layers, and 1 fully connected layer. Both query and search streams are merged using a matching function known as contrastive loss function. The matching function is responsible to differentiate the background information from the target. The SINT is trained offline by giving template patch at query branch and candidate patches at the stream branch. During tracking, SINT does not update its weight parameters and template patch at query branch is matched with the candidate patches at the stream branch for each incoming frame. The SINT estimates the best-matched patch based on maximum score. A ridge-bounding box regression is employed to refine the bounding box.
Siamese fully convolutional network (SiameseFC) proposed by Bertinetto et al.  addresses the general similarity learning between the target image and search image as shown in Figure 3. During training, SiameseFC exploits the deep features using embedding functions and learns the similarity between the two images. During tracking, SiameseFC takes two images and infers a response map. The new target position is estimated at the maximum value on the response map where input images have the maximum similarity.
Valmadre et al.  proposed correlation filter network (CFNet) by adding two layers including correlation filter and crop layer within SiameseFC template branch which makes it more shallower but efficient. While SiameseFC learns the unconstrained features to estimate the similarity score, CFNet learns the discriminative features using correlation filter layer and solves the ridge regression problem via exploiting the negative samples in the search region. Similar to SiameseFC, CFNet is trained offline and weight parameters are fixed during tracking. CFNet produces a response map for template and search region with a high value representing the maximum similarity.
Li et al.  proposed a Siamese region proposal network (SIAMRPN) in order to improve the robustness compared to SiameseFC and CFNet. Both SiameseFC and CFNet do not employ bounding box regression and thus require multi-scale testing. SIAMRPN integrates region proposal network (RPN) within SiameseFC which makes it more elegant. The concept of RPN was introduced in Faster RCNN . RPN has capability to extract more precise and efficient proposals due to the supervision of bounding box regression and binary classifier.
SIAMRPN consists of two components including Siamese network and RPN as shown in Figure 4. Siamese network is responsible for feature computation. Its template branch takes
where represents the classification loss which is a cross entropy loss function and means bounding box regression loss, and λ is a balancing parameter.
3.2 Siamese intermediate merge trackers
This section describes the tracking models where the two input images are input separately to the network and are merged somewhere before the final layer of the CNN.
Held et al.  proposed generic object tracking using regression network (GOTURN) and exploited the target appearance and motion relationships. GOTURN predicts the new target object for the current frame by taking the template image from the previous frame. Both input images are cropped with the background region for prediction as demonstrated in Figure 5. GOTRUN consists of two streams of 5 convolutional layers for both template and search images. The template and search streams are fused and feed-forwarded to three shared fully connected layers. During tracking, GOTURN directly regresses the target position and does not update the weight parameters to adapt the new target appearances.
Chen and Tao  proposed the YCNN tracker to estimate the similarity between two input images. YCNN model consists of two separate 3 convolutional layers and two shared fully connected layers. The target object and search images are fed forward two separate 3 convolutional layers and then merged before forwarding to two shared fully connected layers. The output of YCNN is a response map. The network is trained end-to-end using Gaussian map as a label with the maximum value at the center. During tracking, the maximum position on the confidence map gives the new target position. The drift problem is handled by averaging the maximum five confidence values, while the scale problem is tackled by repeating the inference with different template sizes.
Huang et al.  proposed early stopping tracker (EAST) to exploit similarity between the two input images and learn the different policies by employing Reinforcement Learning (RL) to improve the accuracy while maintaining high speed. On the contrary to SiameseFC, EAST infers the new target position in single evaluation on original template size. The tracking problem is formulated as Markov decision process. The network agent is trained offline such that agent decides whether the target object has high confidence on early layers or continue to go deep by processing subsequent layers to obtain the maximum confidence for each frame. Agent makes a decision based on early stopping criterion for each layer.
3.3 Siamese early merge trackers
In this subsection, we study the tracking models where the input images are aggregated or stacked before feeding to the network.
Fiaz et al.  proposed CNN with structural input (CNNSI) to exploit the deep discriminative features to learn the similarity between the target and candidate patches as shown in Figure 6. The target and candidate images are stacked together and feed-forwarded to the network to get the similarity and dissimilarity scores. The CNNSI is trained offline end-to-end using SGD method to learn the similarity. During the tracking, target and candidate patches are stacked and fed to the network to get similarity and dissimilarity scores for all the candidate patches. The maximum similarity score yields the new target position. The bounding boxes are refined using a bounding box regressor which is trained on the first frame of the sequence. Short-term and long-term updates are performed to integrate the new target appearance.
Taixé et al.  presented a Siamese CNN (SiameseCNN) for pedestrian tracking to exploit the pedestrian appearance and geometrical position. The proposed network requires a stack of two target images along with their optical flow and forwarded to three CNN layers and three fully connected layers. The network is trained using a gradient boosting classifier to predict the final trajectory of the pedestrian. For negative samples, contextual features along with relative geometry are provided to train the classifier. To infer the pedestrian, the gradient boosting classifier makes the final decision based on the maximum score.
4. Experimental analysis
This section discusses the experimental results and analysis over the OTB2013  and OTB2015  benchmarks. The OTB2013 consists of 50 different sequences having 11 challenges including fast motion (FM), background clutter (BC), motion blur (MB), low resolution (LR), scale variation (SV), in-plane rotation (IPR), out-plane rotation (OPR), deformation (DEF), occlusion (OCC), illumination variation (IV), and out-of-view (OV). OTB2015 contains 100 videos, which is an improved version of OTB2013 having all the challenges from OTB2013.
The Siamese trackers are evaluated using precision, success, and speed measures. One pass evaluation (OPE) is utilized to evaluate the robustness of the Siamese trackers. Performance of the trackers is illustrated using precision and success graphs. Euclidean distance is calculated between the ground-truth center and predicted centers to compute the precision as:
where and shows the ground-truth center and predicted center in a frame respectively. A frame is measured as successful if the value of is less than a threshold else not. The precision threshold value is set to 20 pixels. The target changes its size in a sequence and precision only considers the pixel difference of the center of the target. Thus precision does not a true picture of the target shape. Hence, a more robust success metric is employed for evaluation of trackers. An overlap score (OS) is calculated between the ground-truth and predicted bounding box to compute success as:
where represents the bounding box for ground-truth, denotes the predicted bounding box, |.| shows the number of pixels, means intersection and shows the union operator. The determines that a frame is successful or not. If is less than a threshold then that frame is referred to as a successful frame and vice-versa. The overlap score for success varies between 0 and 1, and the threshold is set at 0.5. For precision and success, average precision and average success scores are reported by computing the mean of precision and OS for all the frames in a benchmark respectively. The speed of the Siamese trackers is reported in frames-per-second (FPS) by computing the mean of speed for all the frames in a benchmark.
For comparison of different Siamese architectures, we carefully selected Siamese trackers such that at least one tracker is selected from each category. The selected trackers are SINT , SiameseFC , CFNet , SIAMRPN , GOTURN , and CNNSI . All results are reported from the original authors except, the GOTURN because the authors did not report results over the selected benchmarks.
4.1 Quantitative evaluation
In this subsection, we discuss the quantitative comparison of Siamese Trackers.
4.1.1 Overall performance
Figures 7 and 8 and Table 1 show the precision and success of selected Siamese trackers over OTB2013 and OTB2015 respectively. The precision and success graphs show that SIAMRPN achieved outstanding performance compared to the other trackers. We also observe that the rank of the trackers does not change with respect to precision and success for both benchmarks. GOTURN does not perform well as compared to the other Siamese trackers.
4.1.2 Challenge-based evaluation
We also evaluated the performance of Siamese trackers for eleven different tracking challenges over OTB2015 benchmark. Figures 9 and 10 and Tables 2 and 3 show the performance of Siamese trackers using precision and success respectively. We observe that SIAMRPN attained better performance for all the tracking challenges using both precision and success. While GORTURN does not show good performance and ranked at the last. We noted that SiameseFC exhibited better performance after SIAMRPN for fast motion and low-resolution challenges while SINT ranked second best for the rest of the challenges handling those challenges more efficiently.
4.2 Qualitative evaluation
Qualitative study of Siamese-based trackers has performed over five different videos including
4.3 Speed analysis
We also reported the speed of the trackers as frames per second (fps) as shown in Figure 12. We observe that GOTURN is computational cost effective and tracks objects at a speed of 165 fps. Similarly, SIAMPRN is also computational cost-efficient and can track at 160 fps. Although SiameseFC and CFNet have high computational cost compared to GOTURN and SIAMRPN but still manage to track at high speed. However, SINT (4 fps) and CNNSI (0.53 fps) have very low speed and consume a lot of computational costs.
5. Summary of Siamese networks comparison
We study three different types of Siamese network architectures employed in visual tracking application. We observe that all the Siamese trackers exploit the discriminative ability of deep CNN features. Experimental study revealed that late merge technique is better than others. Table 4 shows the characteristics of the different architecture of Siamese networks.
|Late merge||Intermediate merge||Early merge|
|Definition||Inputs are combined at the final layer||Inputs are combined well before the final layer||Inputs are stacked before feeding network|
|Trackers||SiameseFC, CFNet, SINT, SIAMRPN||GOTURN, YCNN, EAST||CNNSI, SiameseCNN|
|Output (bounding box/score map/scores)||All||All||Scores|
|Features exploitation||Exploits the input images separately which are more discriminative||Initially exploits the input images features and then fused features are exploited which reduces the discriminative ability||Inputs are merged and then processed which reduces the discriminative ability of deep CNN features|
|Performance (precision and success)||Efficient||Moderate||Moderate|
6. Conclusions and future directions
In this chapter we study Siamese networks and their different variants for the task of visual object tracking. Siamese networks are classified into three categories based on their architecture including late merge, intermediate merge, and early merge. We observe that late merge Siamese trackers have shown better performance compared to the other trackers. Our study concludes that SIAMRPN has shown outstanding performance and ranked the best among the selected Siamese trackers. The tracking performance of the Siamese trackers can be improved by integrating both the spatial and temporal information. We observe that almost all the Siamese Networks do not perform the online model update. It would be a great challenge to update the model during the tracking while maintaining the robustness of the Siamese trackers. Other deep features such as RNN, Residual Net and GAN can be exploited within the Siamese networks to improve the tracking performance. Zero-shot and one-shot learning are getting popular due to the limited data issue. Integration of zero-shot and one-shot with Siamese trackers is yet to be explored in the visual object tracking field.
This research was supported by Development project of leading technology for future vehicle of the business of Daegu metropolitan city (No. 20190405).