Trackers introduced in this chapter: T0, a part-based tracker without model update; T1, the part-based tracker with model update; T2, a KNN-based tracker with color and HOG features; T3, co-tracking of KNN-based classifier T2 and part-based detector T1; T4, active co-tracking of T1 and T2 with online update; T5, active asymmetric co-tracking of short-memory T1 and long-memory T2 (modified from ); and T6, active ensemble co-tracking of bagging-induced ensemble and long-memory T2 (modified from ).
Recently, discriminative visual trackers obtain state-of-the-art performance, yet they suffer in the presence of different real-world challenges such as target motion and appearance changes. In a discriminative tracker, one or more classifiers are employed to obtain the target/nontarget label for the samples, which in turn determine the target’s location. To cope with variations of the target shape and appearance, the classifier(s) are updated online with different samples of the target and the background. Sample selection, labeling, and updating the classifier are prone to various sources of errors that drift the tracker. In this study, we motivate, conceptualize, realize, and formalize a novel active co-tracking framework, step by step to demonstrate the challenges and generic solutions for them. In this framework, not only classifiers cooperate in labeling the samples but also exchange their information to robustify the labeling, improve the sampling, and realize efficient yet effective updating. The proposed framework is evaluated against state-of-the-art trackers on public dataset and showed promising results.
- visual tracking
- active learning
- active co-tracking
- uncertainty sampling
Visual tracking is one of the building blocks of human-robot interaction. Implicit or explicit, this task is embedded in many high-level complicated tasks of the robot: automating industrial workcells , attending the speaker in a multimodal spoken dialog system , following the target  and vision-based robot navigation , aerial visual servoing , imitating the behavior of a human , extracting tacit information of an interaction , sign-language interpretation , and autonomous driving as well as simpler tasks such as human-robot cooperation , obstacle avoidance , first-person view action recognition,  and human-computer interfaces .
The most general type of tracking is single-object model-free online tracking, in which the object is annotated in the first frame and tracked in the subsequent frames with no prior knowledge about the target’s appearance, its motions, the background, the configurations of the camera, and other conditions of the scene. Visual tracking is still considered as a challenging problem despite numerous efforts made to address abrupt appearance changes of the target , complex transformations  and deformations [15, 16], background clutter , occlusion , and motion artifacts .
Generative trackers attempt to construct a robust object appearance model or to learn it on the fly using advanced machine learning techniques such as subspace learning , hash learning , dictionary learning , and sparse code learning . General object tracking is the task of tracking arbitrary objects through one-shot learning, typically with no a priori knowledge about the target’s geometry, category, or appearance. Called model-free tracking, the task is to learn the target appearance and update it by adjusting to target’s changes on the fly. To this end, discriminative models focus on target/background separation using correlation filters [23, 24, 25] or dedicated classifiers , which assist them to dominate the visual tracking benchmarks [27, 28, 29]. Using tracking-by-detection approaches is a popular trend in recent years, due to significant breakthroughs in object detection domain (deep residual neural networks , for instance), yielding strong discriminating power with offline training. Adopted for visual tracking, many of such trackers are adjusted for online training and accumulate knowledge about a target with each successful detection (e.g., [26, 31, 32, 33]).
Tracking-by-detection methods primarily treat tracking as a detection problem to avoid having model object dynamics especially in the case of sudden motion changes, extreme deformations, and occlusions [34, 35]. However, there is a multitude of drawbacks in the tracking-by-detection setting:
Label noise: inaccurate labels confuse the classifier  and degrade the classification accuracy . The labeler is typically built upon heuristics and intuitions, rather than using the accumulated knowledge about the target.
Self-learning loop: the classifier is retrained by their own output from earlier frames, thus accumulating error over time .
Uniform treatment of samples: equal weight for all samples in evaluating the target  and training the classifier , despite the uneven contextual information in different samples. The classifier is trained using all the examples with equal weights, meaning that negative examples which overlap very little with the target bounding box are treated equally as those negative examples with significant overlaps.
Stationarity assumption: assuming a stationary distribution of the target appearance does not hold for most of the real-world scenarios with drastic target appearance changes . In the context of visual tracking, the non-stationarity means that the appearance of an object may change so significantly that a negative sample in the current frame looks more similar to a positive example in the previous frames.
Model update difficulties: adaptive trackers inherently suffer from the drifting problem. Noisy model update  and the mismatch between model update frequency and target evolution rate  are two major challenges of the model update. If the update rate is small, the changes of the target are not reflected into target’s template, whereas rapid update of the tracker renders it vulnerable to data noise and small target localization errors. This phenomenon is also known as stability plasticity dilemma.
In this study we motivate, conceptualize, realize, and formalize a novel co-tracking framework. First, the importance of such system is demonstrated by a recent and comprehensive literature review. Then a discriminative tracking framework is formalized to be evolved to a co-tracking by explaining all the steps, mathematically and intuitively. We then construct various instances of the proposed co-tracking framework (Table 1), to demonstrate how different topologies of the system can be realized, how the information exchange is optimized, and how different challenges of tracking (e.g., abrupt motions, deformations, clutter) can be handled in the proposed framework. Active learning will be explored in the context of labeling and information exchange of this co-tracking framework to speed up the tracker’s convergence while updating the tracker’s classifiers effectively. Dual memory is also proposed in the co-tracking framework to handle various tracking scenarios ranging from camera motions to temporal appearance changes of the target and occlusions.
It should be noted that preliminary results of this research were published in [40, 41]; however, the results presented here are slightly different because of using different feature-based auxiliary classifier, different target estimations, and ROI-detection scheme (that was omitted here to conserve the flow of the progressive system design).
2. Tracking by detection
Typically tracking-by-detection method consists of five major steps: SAMPLING, CLASSIFYING, LABELING, ESTIMATING, UPDATING.
SAMPLING: To obtain the positive sample(s) and negative samples (the target and the background, respectively), dense or sparse (stochastic) sampling is performed either around last known target position (using Gaussian distributions, particle filters, or various motion models) or around the saliencies or key points in the current frame . Adaptive weights for the samples based on their appearance similarity to the target , occlusion state , and spatial distance to previous target location  have been considered; especially in the context of tracking by detection, boosting  has been extensively investigated [45, 46, 47].
CLASSIFYING: The classification module of tracking-by-detection schemes utilizes offline-trained classifiers or online supervised learning methods to classify the target from its background (e.g., ). To robustify this module especially against label noise, supervised learning with robust loss functions [46, 49] and semi-supervised [39, 50] and multi-instance [47, 51, 52] learning approaches are considered. Efficient sparse sampling , leveraging context information [17, 54], considering sample information content for the classifier , and landmark-based label propagation  are among other proposed approaches to address this issue. Another interesting approach is to reformulate to couple the labeling and updating process to bridge the gap between the objectives of these two steps, as labeling aims for predicting binary sample labels, whereas updating typically tries to estimate object location . The label noise problem amplifies when the tracker does not have a forgetting mechanism or a way to obtain external scaffolds (i.e., self-learning loop). This inspired the use of co-tracking , ensemble tracking [56, 57], or label verification schemes  to break the self-learning loop using auxiliary classifiers.
LABELING: The result of classification process provides the target/background label for each sample, a process which can be enhanced by employing an ensemble of classifiers [56, 57], exchanging information between collaborative classifiers , and verifying labels by auxiliary classifiers  or landmarks .
ESTIMATING: The state of the target, i.e., the location and scale of the target usually described with a bounding box, is then determined by selecting the sample with the highest classification score , calculating the expectation of target state , or performing an estimated bounding box regression .
UPDATING: Updating the classifier is another challenge of the tracking-by-detection schemes. Updating the classifier, with the data labeled by itself previously in a closed-loop (known as self-learning loop), is susceptible to drift from the original data distribution because a tiny error or a small noise can be amplified. Therefore along with many types of research to revalidate the data labels (such as ), the importance of having a “teacher” to guide the classifier during training is discussed in literature . Cooperative classifiers in frameworks such as ensembles of homogeneous or heterogeneous classifiers , co-learning , and hybrids of generative and discriminative models  are some of the approaches to provide this guidance through cooperation. Furthermore, feature selection based on its discrimination ability , replacing the weakest classifier of an ensemble  or the oldest one , or applying a budget on the sample pool (hence, keeping only some prototypical samples) [15, 43] is proposed to improve the performance of such solutions.
On top of that, the frequency of update is another important role player in tracker’s performance . Higher update rates capture the rapid target changes but is prone to occlusions, whereas slower update paces provide a long memory for the tracker to handle temporal target variations but lack the flexibility to accommodate permanent target changes. To this end, researchers try to combine long- and short-term memories  and role-back improper updates  or utilize different temporal snapshots of the classifier to overcome non-stationary distribution of the target’s appearance . This pipeline, however, was altered in some studies to introduce desired properties, e.g., to avoid label noise by merging sampling and labeling steps .
Online visual tracking is the task to update the state vector involving location, size, and shape of the bounding box, at each observation of video frame . The update process is sometimes written with transformation that transforms the previous state vector to the current state .
In tracking-by-discrimination framework, we utilize a classifier that discriminates an image patch into either target or background, where the classifier is denoted as a real valued discriminant function and the function value is called a discrimination score or, in short, score. The patch (i.e., the area of the image bounded by the bounding box ) is labeled as target if with a threshold and as background if . A typical procedure of the tracking-by-discrimination is written as follows.
SAMPLING: The samples are defined using these transformations, and their corresponding image patches are selected from image. We obtain samples of state by drawing random transformations using dense or sparse sampling strategy, transforming the previous state with a transformations as .
CLASSIFYING: We calculate the score of the image patches corresponding to all samples, or bounding boxes, using the current classifier ():
LABELING: We determine label of each sample using the score of the sample. If the score is above a threshold , the sample is likely to be target match:
ESTIMATING: We determine the next target state typically by selecting the best that corresponds to the maximum score , s.t. .
UPDATING: Finally, we update the classifier by its own labeled data:
in which is the update function (e.g., budgeted SVM update ) and are the set of input patches and output labels used as the training set of the discriminator.
2.2. Baseline system implementation
To develop a baseline tracking-by-detection algorithm for this study, we use a robust part-based detector for the CLASSIFYING process. This detector employs strong low-level features based on histograms of oriented gradients (HOG) and uses a latent SVM to perform efficient matching for deformable part-based models (pictorial structures) . From each frame, we draw samples from a Gaussian distribution whose mean is the target’s bounding box in the last frame (including its location and size). The selected detector then outputs the classification score for each sample, which is thresholded to obtain the sample’s label. The highest classification score is considered as the current target location (Figure 1).
In the first frame, we generate -positive samples by perturbing the first annotated target patch by few pixels in location and size, select -negative samples from local neighborhood of the target, and select -negative samples from global background of the object in a regular grid (). These samples are used to train the SVM detector in the first frame. From the next frames, the labels are obtained by the detector itself, and the classifier is batch-trained with all of the samples collected so far.
There are several parameters in the system such as the parameters of sampling step (number of samples , effective search radius ). These parameters were tuned using a simulated annealing optimization on a cross validation set. The part-base detector dictionary, and the thresholds , and the rest of abovementioned parameters have been adjusted using cross validation. With T1 achieved the speed of 47.29 fps on a Pentium IV PC @ 3.5 GHz and a Matlab/C++ implementation on a CPU.
2.3. Method of evaluation
The experiments are conducted on 100 challenging video sequences, OTB-100 , which involves many visual tracking challenges such as target appearance, pose and geometry changes, environment lighting and camera position changes, target movement artifacts such as blur and trajectory variations, and low imaging resolution and noise and background objects which may cause occlusions, clutter, or target identity confusion. The performance of the trackers is compared with the area under the curve of success plots and precision plots, on all of the sequences, or a subset of them with the given attribute.
Success plot indicates the reliability of the tracker and its overall performance, while precision plot reflects the accuracy of the localization. The area under the surface of this plot () counts the number of successes of tracker over time , i.e., when the overlap of the tracker target estimation with the ground truth exceeds the threshold . Success plot graphs the success of the tracker against different values of the threshold , and its is calculated as
where is the length of sequence; denotes the area of the region; and stand for intersection and union of the regions, respectively; and denotes the step function that returns 1 iff its argument is positive and 0 otherwise. This plot provides an overall performance of the tracker, reflecting target loss, scale mismatches, and localization accuracy.
To establish a fair comparison with the state of the art of tracking-by-detection algorithms, TLD  and STRUCK  are selected based on the results of , BSBT  and MIL  are selected based on popularity, and CSK  was selected as one of the latest algorithms in the category. Since our trackers contain random elements (in sampling and resampling), the results reported here are the average of five independent runs.
Figure 2 presents the success and precision plots of T1 along with other competitive trackers for all sequences. We also included a fixed version of T1 tracker (a detector without model update) as T0 to emphasize the role of updating. The figure demonstrates that without the model update, the detector cannot reflect the changes in target appearance and lose the target rapidly in most of the scenarios (comparing T0 and T1). However, it is also evident that having a single tracker is not robust against all of the target’s variations (in line with ) and the performance of T1 is still low.
A single detector may have difficulties in distinguishing the target from the background in certain scenarios. In those cases, it is beneficial to consult another detector with higher robustness. These second detector may have complimentary characteristics to the first one or simply may be a more sophisticated detector that trades computational complexity with speed.
Collaborative discriminative trackers utilize classifiers that exchange their information, to achieve more robust tracking. These information exchanges are in the form of queries that one classifier sends to another. The purpose of this information exchange is to bridge across long-term and short-term memories ; accommodate multi-memory dictionaries , mixture of deep and shallow models ; facilitate multi-view on the data ; and enable learning from mistakes .
Built on co-training principle , collaborative tracking (co-tracking) provides a framework in which two classifiers exchange their information to promote tracking results and break self-learning loop (Figure 3). In this two-classifier framework , the challenging samples for one classifier are labeled by the other one, i.e., if a classifier finds a sample difficult to label, it relies on the other classifier to label it for this frame and similar samples in the future. In this case, we calculate the discrimination score as a weighted sum of the two discriminant functions, where denotes the weight of each discriminator , . At the CLASSIFYING step, the corresponding sample is considered as a challenging sample for the th discriminator when holds because it locates close to the corresponding discrimination boundary. When one of the two discriminators answered it challenging, the score of the sample is calculated with using the other score:
At the UPDATING step, the weight of the discriminator is adjusted according to the degree of contradiction to the provisional answers that are determined at the ESTIMATION step by an integration of all the information. Finally, the classifiers are updated using only the samples that they successfully labeled in the previous frame to reflect the latest target changes.
For this experiment, we selected a naive classifier with complementary properties to the main classifier in the previous section. This classifier is a KNN classifier using HOC and HOG features, trained on the samples trained from the first frame and updated with all the labeled samples by the collaboration of the classifiers. Not being pre-trained, the performance of this auxiliary classifier is poor in the beginning but gradually gets better. The quick classification of the KNN (owning to its kd-tree implementations and lightweight features) and lack of pre-training grant it high speed and generalization which is in contrast to the main detector. However, it should be noted that without being supervised by the main SVM-based detector, this classifier cannot perform well in isolation for tracking task. Figure 5 presents the performance of this auxiliary tracker as T2. As observed in the figure, the performance of the obtained co-tracker (T3) is better than the main detector (T1) and the auxiliary classifier (T2) as a result of co-labeling, data exchange, and co-learning.
4. Active co-tracking
The co-tracking framework provides a means for classifiers to exchange information. This framework utilizes a utility measure (e.g., the classification confidence in ) to select the data for which one of the collaborators fails to classify with high confidence and then trains the other classifier on those data. This approach has two main shortcomings: (1) the redundant labeling of all samples for both classifiers and (2) training the collaborator with “all” of the uncertain samples. While the former increases the complexity of the system, the latter is not the optimal solution for tracking a target with non-stationary appearance distributions .
In this view, a principled ordering of samples for training  and selecting a subset of them based on criteria  can reduce the cost of labeling leading to faster performance increase as a function of the amount of data available. It is found that detectors trained with an effective, noise-free, and outlier-free subset of the training data may achieve higher performance than those trained with the full set [71, 72].
Robust learning algorithms provide an alternative way of differentially treating training examples, by assigning different weights to different training examples or by learning to ignore outliers . Learning first from easy examples , pruning adversarial examples1 , and sorting the samples based on their training value  are some of the approaches explored in the literature. However, the most common setting is active learning, whereby most of the data is unlabeled and an algorithm selects which training examples to label at each step, for the highest gains in performance. Thus, some active learning approaches focus on learning the hardest examples first (those closest to the decision boundary). Some approaches focus on learning the hardest examples first (e.g., those closest to the decision boundary), whereas some others gauge the information contained in the sample and select the most informative ones first. For example, Lewis and Gale  utilized the uncertainty of the classifier for a sample as an index of its usefulness for training.
4.1. The idea
Active learning has been used in visual tracking to consider the uncertainty caused by bags of samples , to reduce the number of necessary labeled samples , to unify sample learning and feature selection procedure , and to reduce the sampling bias by controlling the variance .
In this study, we utilized the sampling uncertainty that can bind the active learning and co-tracking. As mentioned earlier, the baseline classifier, despite being accurate, has low generalization on new samples, slow classification speed, and computationally expensive retraining. On the other hand, the auxiliary classifier is agile and learns rapidly, with negligible retraining time. To combine the merits of these two classifiers, to cancel out their demerits with one another, and to address the aforementioned issues of co-tracking (redundant labeling and excessive samples), we incorporate an active learning module to select the most informative data, i.e., those for which the naive classifier is most uncertain, and query their labels from the part-based detector. This architecture (Figure 4, here called T4) mainly uses naive classifier for labeling the data and only asks the label of hard samples from the slower detector and, therefore, limits the redundancy and unleashes the speed of the agile classifier. In addition, by training the naive classifier only on hard samples, the generalization of this classifier is preserved while increasing its accuracy.
To further increase the accuracy of the tracker and make it more robust against occlusions and drastic temporal changes of the target, it is possible to update the detector less frequently. This asymmetric version of the active co-tracker (T5), by introducing long-term memory to the tracker, benefits from combining the long- and short-term collaboration (as in ) and reduces the frequency of the expensive updates of the tracker (Algorithm 1).
Algorithm 1: Active co-tracking (ACT)
Input: Target position in last frame
Output: Target position in current frame
for to do
Generate a sample
Determine uncertain samples (Eq.(7))
if then is uncertain
Label using :
Update with every frames (for T4)
if and then
Approximate target state (Eq.(9))
else target occluded
In the proposed active co-tracking framework, a main classifier attempts to label the sample, and it queries the label from the other classifier if the main classifier emits uncertain results. This is in contrast with using a linear combination of both classifiers based on their classification accuracy as adopted in T3. At the CLASSIFYING step, the proposed tracker can score each sample based on the classifier confidence, i.e., for sample we calculate score :
Based on uncertainty sampling , the samples for which the classification score is more uncertain (i.e., ) contain more information for the classifier if they are labeled by the other classifier. Therefore, the scores of all samples are sorted, and samples with the closest values to 0 are selected to be queried from . To handle the situations for which the number of highly uncertain samples are more than , a range of scores are determined by lower and higher thresholds (and ), and all the samples in this range are considered highly uncertain:
in which is the list of uncertain samples. The label of the samples is then determined by
and all image patches and labels are stored in .
At the ESTIMATION step, we follow the importance sampling mechanism originally employed by particle filter trackers:
where and are the indicator function, 1 if true, zero otherwise. This mechanism approximates the state of the target, based on the effect of positive samples, in which samples with higher scores gravitate the final results more toward themselves. Upon the events such as massive occlusion or target loss, this sampling mechanism degenerates . In such cases, the number of positive samples and their corresponding weights shrinks significantly, and the importance sampling is prone to outliers, distractors, and occluded patches. To address this issue, if the number of positive samples is less than , and their score average is less than , the target is deemed occluded to avoid tracker degeneracy.
Figure 5 illustrates the effectiveness of the proposed trackers against their baselines. The active query mechanism in T4 improves the efficiency and effectiveness of co-tracking (T3). Especially in the asymmetric co-tracker (T5), the mixture of long-term and short-term memory classifiers using this method is to key to automatically balance the stability-plasticity equilibrium. It is also prudent for the tracker to adapt to the temporal distribution of the target appearance, before its redistribution by illumination changes, etc.
In summary, the advantages of the proposed trackers especially the asymmetric ones (T5) compared to the conventional co-tracking (T3) are as follows: (1) the classifiers do not exchange all the data they have problems in labeling; instead, the most informative samples are selected by uncertainty sampling and exchanged; (2) the update rate of classifiers is different to realize a short- and long-term memory mixture; (3) the samples that are labeled for the target localization can be reused for training, and the need for an extra round of sampling and labeling is revoked; and (4) since in the proposed asymmetric co-tracking, one of the classifiers scaffolds the other one instead of participating in every labeling process, a more sophisticated classifier with higher computational complexity can be used.
5. Active ensemble co-tracking
Ensemble discriminative tracking utilizes a committee of classifiers, to label data samples, which are in turn used for retraining the tracker to localize the target using the collective knowledge of the committee. In such frameworks the labeling process is performed by leveraging a group of classifiers with different views [45, 56, 80], subsets of training data [57, 81], or memories [57, 82].
In ensemble tracking [45, 47, 56, 57, 60, 83, 84, 85], the self-learning loop is broken, and the labeling process is performed by eliciting the belief of a group of classifiers. However, this framework typically does not address some of the demands of tracking-by-detection approaches like a proper model update to avoid model drift or non-stationary of the target sample distribution. Besides, ensemble classifiers do not exchange information, and collaborative classifiers entirely trust the other classifier to label the challenging samples for them and are susceptible to label noise.
Traditionally, ensemble trackers were used to providing a multi-view classification of the target, realized by using different features to construct weak classifiers. In this view, different classifiers represent different hypotheses in the version space, to accurately model the target appearance. Such hypotheses are highly overlapping; therefore an ensemble of them overfits the target. The desired committee, however, consists of competing hypotheses, all consistent with the training data, but each of the specialized in certain aspect. In this view, the most informative data samples are those about which the hypotheses disagree the most, and by labeling them, the version space is minimized leading to quick convergence yet accurate classification . Motivated by this, we proposed a tracker that employs a randomized ensemble of classifiers and selects the most informative data samples to be labeled.
5.1. The idea
To create ensembles of classifiers, researchers typically make different classifiers by altering the features , using a pool of appearance and dynamics models , utilizing different memory horizons , and employing previous snapshots of a classifier in different times , but creating a collaborative mechanism in the ensemble, where classifiers exchange information is hardly addressed in the visual tracking literature. This data exchange can be in the form of query passing between ensemble members, in which the queries can be the samples for which a classifier is uncertain or even the ensemble is most uncertain.
Selecting such queries is addressed in different machine learning domains such as curriculum learning  and active learning. Query-by-Committee (QBC) algorithm [86, 88] is an active learning approach for ensembles that selects the most informative query to pass within a committee of models which are all trained on the current labeled set but represent competing hypotheses. The label of the queried sample is then decided by the vote of the ensemble members, and the samples for which the ensemble has more diverse ideas are selected as the next query to ask from the teacher (here, the auxiliary classifier). In this case, where the task is a binary classification, the most disputed sample (i.e., with close positive and negative votes) is the most informative since learning its label would maximally train the ensemble. Training with the external label for this sample, shrinks the version space (i.e., the space of all consistent hypotheses with the training data) such that it remains consistent with the hypotheses of all classifiers, but rejects more potential incorrect ones.
QBC was originally designed to work with stochastic learning algorithms, which pose limitations to use it with non-probabilistic or deterministic models. To alleviate this problem, Abe and Mamitsuka  enable deterministic classifiers to work with random subsets of training data to create different variations of the same learning model. By creating temporary ensemble using this “bagging” procedure , they realized Query-by-Bagging (QBag) to enhance the learning speed and generalization of the base learning algorithm.
We propose the adjustment of the QBag algorithm for online training to solve the label noise problem in T6. Similar to T5, the drift problem is handled using dual-memory strategy: the committee rapidly adapts to target changes, whereas the main classifier possesses a longer memory to promote the stability of the target template (Figure 6).
An ensemble discriminative tracker employs a set of classifiers instead of one. These classifiers, hereafter called committee, are represented by and are typically homogeneous and independent (e.g., [56, 85]). Popular ensemble trackers utilize the majority voting of the committee as their utility function:
And Eq. (8) is used to label the samples. Finally, the model is updated for each classifier independently, meaning that each of the committee members is trained with a random subset of the uncertain set. where is the updating the model with samples . The uncertain set contains all of the samples for which the ensemble disagrees and was sent to the auxiliary classifier for labeling. The detector is also updated with all recent data every frames.
Figure 7 depicts the overall performance of the proposed tracker against other benchmarked algorithms on all sequences of the dataset. The plots show that T6 has a superior performance over T5 and its predecessors. The steep slope between indicates the high quality of the predictions (i.e., more predictions have higher overlap with the ground truth, rather than being partially correct), and the other slope around along with high success rate near indicates that the algorithm was successful in continue tracking, despite all the tracking challenges.
The instances of the proposed framework are evaluated against state-of-the-art trackers on public sequences that become the de facto standards of benchmarking the trackers. The trackers are compared with popular metrics such as success plot and precision plot to establish a fair benchmark. In addition, the performance of the proposed trackers is investigated for videos with a distinguished tracking challenge, and the results are compared with state of the art and discussed. Additionally, the effect of the information exchanged will be examined thoroughly to illustrate the dynamics of the system. The preliminary results of the proposed framework demonstrate a superior performance for the proposed trackers when applied on all the sequences and most of the subsets of the test dataset with distinguished challenges. Finally, the future research direction is discussed, and the opened research avenues are introduced to the field.
As Figure 7 and Table 2 demonstrate, T6 has the best overall performance among investigated trackers on this dataset. While this algorithm has a clear edge in handling many challenges, its performance is comparable with T5 in the case of occlusions and z-rotations. It is also evident that T6 is troubled with fast deformations since neither of the ensemble members is specialized in handling a specific type of deformations and the collective decision of the ensemble may involve mistakes with high confidence. On the other hand, T5 utilizes a dual-memory scheme, and a single classifier can handle extreme temporal deformations better than the ensemble in T6. Interestingly, it is observed that in most of the subcategories that T6 is clearly better than the other trackers, the success plot of T6 starts with a plateau and later has a sharp drop around . This means that T6 provides high-quality localization (i.e., bigger overlaps with the ground truth). Similarly, from precision plot, it is evident that T6 shows a graceful degradation in different scenarios, and although it does not provide a good scale adaptation for targets, it is able to localize them better than the competing trackers (Figure 8).
7. Conclusions and future works
This chapter provides a step-by-step tutorial for creating an accurate and high-performance tracking-by-detection algorithm out of ordinary detectors, by eliciting an effective collaboration among them. The use of active learning in junction with co-learning enables the creation of a battery of tracker that strives to minimize the uncertainty of one classifier by the help of another. The progressive design leads to use a committee of classifiers that use online bagging to keep up with the latest target appearance changes while improving the accuracy and generalization of the base tracker (a feature-based KNN). Inspired by the query-by-bagging algorithm, this algorithm selects the most informative samples to learn from the long-term memory auxiliary detector, which realizes a gradually decreasing dependence on this slow and likely overfit detector yet robust against fluctuations in target appearance and occlusions. Furthermore, using an expectation of the bounding boxes compensates for overreliance of the tracker on the classifiers’ confidence function. The balance in stability-plasticity equilibrium is achieved by the combination of several short-term classifiers with a long-term classifier and managing their interaction with an active learning mechanism.
The trail of proposed trackers led to T6, which incorporates ensemble tracking, active learning, and co-learning in a discriminative tracking framework and outperform state-of-the-art discriminative and generative trackers on a large video dataset with various types of challenges such as appearance changes and occlusions.
The future direction of this study involves other detectors to care for context, to have accurate physical models for known categories, to use deep features to improve discrimination, and to examine different methods of building the ensemble and detecting most informative samples or exchanging.
This article is based on results obtained from a project commissioned by the Japan NEDO and was supported by Post-K application development for exploratory challenges from the Japan MEXT.
- Images with tiny, imperceptible perturbations that fool a classifier into predicting the wrong labels with high confidence