摘要 :
Object detection methods draw increasing attention in deep learning based visual tracking algorithms due to their robust discrimination and powerful regression ability. To further explore the potential of object detection methods ...
展开
Object detection methods draw increasing attention in deep learning based visual tracking algorithms due to their robust discrimination and powerful regression ability. To further explore the potential of object detection methods in the visual tracking task, there are two gaps that need to be bridged. The first is the difference in object definition. Object detection is class-specific while visual tracking is class agnostic. Moreover, visual tracking needs to differentiate the target from intra-class distractors. The second is the difference in temporal dimension. Different from object detection which processes still-image, visual tracking concentrates on objects which vary continuously with time. In this paper, we propose a Detection to Tracking (D2T) framework to address the above issues and effectively transfer existing advanced detection methods to visual tracking task. Specifically, to bridge the gap of object definition, we propose a general-to-specific network that separates learning general object features and instance-level features. To make full use of the contextual information while adapting to the appearance variation of targets, we propose a temporal strategy combining short-term constraint and long-term updating. To the best of our knowledge, our D2T framework is the first universal framework which directly transfers deep learning based object detectors to visual tracking task. It provides a novel solution to visual object tracking, and it achieves superior performance in several public datasets. (c) 2022 Elsevier Ltd. All rights reserved.
收起
摘要 :
With the increasing popularity of Unmanned Aerial Vehicles (UAVs) in computer vision-related applications, intelligent UAV video analysis has recently attracted the attention of an increasing number of researchers. To facilitate r...
展开
With the increasing popularity of Unmanned Aerial Vehicles (UAVs) in computer vision-related applications, intelligent UAV video analysis has recently attracted the attention of an increasing number of researchers. To facilitate research in the UAV field, this paper presents a UAV dataset with 100 videos featuring approximately 2700 vehicles recorded under unconstrained conditions and 840k manually annotated bounding boxes. These UAV videos were recorded in complex real-world scenarios and pose significant new challenges, such as complex scenes, high density, small objects, and large camera motion, to the existing object detection and tracking methods. These challenges have encouraged us to define a benchmark for three fundamental computer vision tasks, namely, object detection, single object tracking (SOT) and multiple object tracking (MOT), on our UAV dataset. Specifically, our UAV benchmark facilitates evaluation and detailed analysis of state-of-the-art detection and tracking methods on the proposed UAV dataset. Furthermore, we propose a novel approach based on the so-called Context-aware Multi-task Siamese Network (CMSN) model that explores new cues in UAV videos by judging the consistency degree between objects and contexts and that can be used for SOT and MOT. The experimental results demonstrate that our model could make tracking results more robust in both SOT and MOT, showing that the current tracking and detection methods have limitations in dealing with the proposed UAV benchmark and that further research is indeed needed.
收起
摘要 :
To exploit the reliable prior knowledge that the target object in tracking must be an object other than nonobject, in this letter, we propose to adapt objectness for visual object tracking. Instead of directly applying an existing...
展开
To exploit the reliable prior knowledge that the target object in tracking must be an object other than nonobject, in this letter, we propose to adapt objectness for visual object tracking. Instead of directly applying an existing objectness measure that is generic and handles various objects and environments, we adapt it to be compatible to the specific tracking sequence and object. More specifically, we use the newly proposed binarized normed gradient (BING) objectness as the base, and then train an object-adaptive objectness for each tracking task. The training is implemented by using an adaptive support vector machine that integrates information from the specific tracking target into the BING measure. We emphasize that the benefit of the proposed adaptive objectness, named ADOBING, is generic. To show this, we combine ADOBING with eight top performed trackers in recent evaluations. We run the ADOBING-enhanced trackers along with their base trackers on the CVPR2013 benchmark, and our methods consistently improve the base trackers both in overall performance and under all challenge factors. Noting that the way we integrate objectness in visual tracking is generic and straightforward, we expect even more improvement by using tracker-specific objectness.
收起
摘要 :
ABSTRACT The integration of visual data obtained from unmanned aerial vehicles (UAVs) has ushered in an era of computer vision, greatly expanding the possibilities for object tracking applications. Nevertheless, existing UAV datas...
展开
ABSTRACT The integration of visual data obtained from unmanned aerial vehicles (UAVs) has ushered in an era of computer vision, greatly expanding the possibilities for object tracking applications. Nevertheless, existing UAV datasets predominantly focus on large-scale objects characterized by distinct contours, overlooking single tiny objects encountered in real-world flight scenarios. Extracting appearance information from these diminutive objects poses a considerable challenge for object tracking. To rectify this imbalance in data distribution, we proposed a UAV dataset called Overhead Look Of Drones (OLOD), encompassing 70 sequences meticulously designed to address tiny object tracking. It contains over 55k frames and provides supplementary information about altitude and flight attitude. Additionally, we incorporated 11 challenging attributes to enhance the complexity of the scenes, thereby establishing a comprehensive benchmark for single object tracking. OLOD serves as a valuable tool for evaluating the tracking capabilities of various algorithms when it comes to tiny objects. Subsequently, through experimental results, we shed light on the limitations of existing methods for tracking tiny objects on this benchmark, underscoring the necessity for further research in this field. Our dataset and evaluation code will be released at https://github.com/yuymf/OLOD.
收起
摘要 :
In this study, the authors propose two kinds of improvements to a baseline tracker that employs the tracking-by-detection framework. First, they explore different feature spaces by employing features commonly used in object detect...
展开
In this study, the authors propose two kinds of improvements to a baseline tracker that employs the tracking-by-detection framework. First, they explore different feature spaces by employing features commonly used in object detection to improve the performance of detector in feature space. Second, they propose a robust scale estimation algorithm that estimates the size of the object in the current frame. Their experimental results on the challenging online tracking benchmark-13 dataset show that reduced dimensionality histogram of oriented gradients boosts the performance of the tracker. The proposed scale estimation algorithm provides a significant gain and reduces the failure of the tracker in challenging scenarios. The improved tracker is compared with 13 state-of-the-art trackers. The quantitative and qualitative results show that the performance of the tracker is comparable with the state of the art against initialisation errors, variations in illumination, scale and motion, out-of-plane and in-plane rotations, deformations and low resolution.
收起
摘要 :
Cyber physical system (CPS) is a complex system combining computation, network and physics; object tracking is an important application of CPS. To solve the problem that the traditional kernel correlation filtering tracking algori...
展开
Cyber physical system (CPS) is a complex system combining computation, network and physics; object tracking is an important application of CPS. To solve the problem that the traditional kernel correlation filtering tracking algorithm cannot recover the lost object, the authors propose a re-detection object tracking algorithm. The proposed algorithm mainly designs a new adaptive detection criterion. By comparing the value of detection criterion and the value of the experience threshold, it can be judged whether the current target is lost or not. When the object tracking fails, the proposed method can generate target candidate boxes by using the edge boxes algorithm and select the best target location by applying the non-maximum suppression and the Euclidean metric methods. In addition, a fast multi-scale estimation method and an adaptive updating method are added to the tracking procedure to further improve the overall performance of the algorithm. Experimental results show that the proposed approach has a good performance in terms of precision and success rates.
收起
摘要 :
In the past, there has been significant research on tracking objects based on features that characterize objects. However, once the features to detect the object are not available from the position of a tracker, it is not clear ho...
展开
In the past, there has been significant research on tracking objects based on features that characterize objects. However, once the features to detect the object are not available from the position of a tracker, it is not clear how to track the object. Our goal in this paper is to track an object even though the object leaves the field-of-view. In this paper, we first describe the problems related to this type of tracking. We then explain our approach that has two phases: Real object tracking and virtual object tracking. We mostly focus on virtual object tracking that requires reorientation, distance estimation, traveling the computed distance, estimating the direction for turn, and turning into the direction. We show our results by building a robot that achieves these.
收起
摘要 :
Video monitoring is an important means of ship traffic supervision. In practice, regulators often need to use an electronic chart platform to determine basic information concerning ships passing through a video feed. To enrich the...
展开
Video monitoring is an important means of ship traffic supervision. In practice, regulators often need to use an electronic chart platform to determine basic information concerning ships passing through a video feed. To enrich the information in the surveillance video and to effectively use multimodal maritime data, this paper proposes a novel ship multi-object tracking technology based on improved single shot multibox detector (SSD) and DeepSORT algorithms. In addition, a night contrast enhancement algorithm is used to enhance the ship identification performance in night scenes and a multimodal data fusion algorithm is used to incorporate the ship automatic identification system (AIS) information into the video display. The experimental results indicate that the ship information tracking accuracies in the day and night scenes are 78.2% and 70.4%, respectively. Our method can effectively help regulators to quickly obtain ship information from a video feed and improve the supervision of a waterway.
收起
摘要 :
Recently, the correlation filter (CF) and Siamese network have become the two most popular frameworks in object tracking. Existing CF trackers, however, are limited by feature learning and context usage, making them sensitive to b...
展开
Recently, the correlation filter (CF) and Siamese network have become the two most popular frameworks in object tracking. Existing CF trackers, however, are limited by feature learning and context usage, making them sensitive to boundary effects. In contrast, Siamese trackers can easily suffer from the interference of semantic distractors. To address the above problems, we propose an end-to-end target-insight correlation network (TICNet) for object tracking, which aims at breaking the above limitations on top of a unified network. TICNet is an asymmetric dual-branch network involving a target-background awareness model (TBAM), a spatial-channel attention network (SCAN), and a distractor-aware filter (DAF) for end-to-end learning. Specifically, TBAM aims to distinguish a target from the background in the pixel level, yielding a target likelihood map based on color statistics to mine distractors for DAF learning. SCAN consists of a basic convolutional network, a channel-attention network, and a spatial-attention network, aiming to generate attentive weights to enhance the representation learning of the tracker. Especially, we formulate a differentiable DAF and employ it as a learnable layer in the network, thus helping suppress distracting regions in the background. During testing, DAF, together with TBAM, yields a response map for the final target estimation. Extensive experiments on seven benchmarks demonstrate that TICNet outperforms the state-of-the-art methods while running at real-time speed.
收起
摘要 :
In this work, we study the method exploiting natural language network to improve tracking performance. We propose a novel architecture which can combine class and visual information presented in tracking. To this end, we introduce...
展开
In this work, we study the method exploiting natural language network to improve tracking performance. We propose a novel architecture which can combine class and visual information presented in tracking. To this end, we introduce a multimodal feature association network, allowing us to correlate the target class with its appearance during training and aid the localization of the target during inference. Specifically, we first utilize an appearance model to extract the target visual features, from which we obtain appearance cues, for instance shape and color. In order to employ target class information, we design a learned lightweight embedding network to embed the target class into a feature representation. The association network of our architecture contains a multimodal fusion module and a predictor module. The fusion module is used to combine features from class and appearance, yielding multimodal features with more expressive representations for the subsequent module. The predictor module is used to determine the target location in the current frame, from which we associate the class to the appearance. The class embedding module thus can learn appearance cues by exploiting the backpropagation functionality. To verify the abilities of our method, we select the official training and test splits of the LaSOT with annotated images and classes to perform experiments. In particular, we analyze the imbalance in the samples and employ a class validator discriminator to alleviate this problem. Extensive experimental results on LaSOT, UAV20L and UAV123@10fps demonstrate our method achieves competitive results while maintaining a considerable real-time speed.
收起