The Ultimate Guide to Video Object Detection

9 min readOct 27, 2020

Object recognition in video is an important task for plenty of applications, including autonomous driving perception, surveillance tasks, wearable devices and IoT networks. Object recognition using video data is more challenging than using still images due to blur, occlusions or unusual object poses. As the appearance of objects may deteriorate in some frames, features or detections from other frames are commonly used to enhance the prediction. There are various approaches to the problem: dynamic programming, tracking methods, Recurrent Neural Networks, feature aggregation with and without optical flow to propagate the high-level features across frames. Some methods use a sparse approach for detection or feature aggregation, and therefore, they improve the speed of inference significantly. The leading methods, which provide the best accuracy when combined together — Multi-frame Feature Aggregation without Optical Flow and the Seq-NMS post processing — are quite slow (less than 10 FPS on GPU). There is a trade-off between accuracy and speed: usually, faster methods have lower accuracy, so there is still a big potential for research and novel methods, which combine both accuracy and speed.

Want to read this story later? Save it in Journal.

Approaches to Object Detection in Videos

· Post-Processing

· Tracking-based Methods

· 3D Convolutions

· Recurrent Neural Networks

· Feature Propagation Methods

· Multi-frame Feature Aggregation with Optical Flow

· Multi-frame Feature Aggregation without Optical Flow

Post-Processing Methods

Post-processing methods are general procedures, which can be applied to the output of any object detector to improve object detection in a video.

Sequence Non-Maximal Suppression (Seq-NMS)

Seq-NMS applies modification to detection confidences based on other detections on a “track” via dynamic programming. It uses high-scoring object detections from nearby frames to boost scores of weaker detections within the same video clip. Seq-NMS postprocessing minimizes the number of wrong detections between frames or random jumping detections and stabilizes the output result, however, it slows down the computations significantly. In addition, the inference becomes offline (the method demands future frames for processing). Results: FGFA (RFCN, ResNet101) 76.3 MAP and 1.4 FPS, FGFA (RFCN, ResNet101) + Seq-NMS: 78.4 MAP and 1.1 FPS.

Seq-Bbox Matching

Since adjacent frames are similar and usually contain a certain number of moving objects, detection results in multiple adjacent frames are considered as multiple detection results of the same objects (tubulet). The last bbox of one tubelet and the first bbox of another one are matched. The bboxes of the same tubelet are rescored by averaging their classification scores. Tubelet-level bounding box linking helps to infer missed detections and improve detection recall. When applied sparsely on the video frames, the method improves an Object Detector’s results significantly, while increasing the speed. Results: YOLOv3: 68.6 MAP and 23 FPS, YOLOv3 + Seq-Bbox Matching: 78.2/80.9 MAP online/offline and 38 FPS.

Robust and Efficient Post-Processing (REPP)

REPP links detections across frames by evaluating their similarity and refines their classification and location to suppress false positives and recover misdetections. For all possible pairs of detections from consecutive frames (t and t + 1), a set of features is built based on their location, geometry, appearance and semantics. These features are used to predict a linking (similarity) score. Links are established between consecutive frames, and tubelets are composed between the first pair of frames and extended as long as corresponding objects are still found in the next subsequent frames. REPP supposes a light computation overhead, but the inference becomes offline. Results: YOLOv3: 68.6 MAP and 23 FPS, YOLOv3 + REPP: 75.1 MAP and 22 FPS.

Tracking-based Methods

Integrated Object Detection and Tracking with Tracklet-Conditioned Detection

Tracklet-Conditioned Detection network integrates both detection and tracking at early stages: instead of simply aggregating two sets of bounding boxes that are estimated separately by the detector and tracker, a single set of boxes is generated jointly by the two processes by conditioning the outputs of the object detector on the tracklets computed over the prior frames. This way, the resulting detection boxes are both consistent with the tracklets and have high detection responses, instead of often having just one or the other in late integration techniques. The model achieves 83.5 MAP on ImageNet VID (with the R-FCN ResNet101 backbone) for online settings.

3D Convolutions

Convolutional Neural Networks with 3D convolutions have mostly been proven to be useful and fruitful when it comes to processing 3D images such as on MRI scans. 3D convolutions do not improve object detection in videos that much when compared to a single frame.

Recurrent Neural Networks

Mobile Video Object Detection with Temporally-Aware Feature Maps

The model combines fast single-image object detection with convolutional LSTM layers to create an interweaved recurrent-convolutional architecture. An efficient Bottleneck-LSTM layer significantly reduces computational cost compared to regular LSTMs. The model is online and runs in real time on low-powered mobile and embedded devices achieving 45.1 MAP and 14.6 FPS on a mobile device.

Looking Fast and Slow: Memory-Guided Mobile Video Object Detection

The model contains two feature extractors with different speeds and recognition capacities, which run on different frames. The features from these extractors are used to maintain a common visual memory of the scene in the form of a convolutional LSTM, and detections are generated by fusing context from previous frames with the gist (a rich representation) from the current frame. The combination of memory and gist contains within itself the information necessary to decide when the memory must be updated. The model is online and runs at 72.3 FPS on a mobile device achieving 59.1 MAP.

Feature Propagation Methods

Deep Feature Flow for Video Recognition (DFF)

Optical flow is currently the most explored field to exploit the temporal dimension of video object detection. DFF runs the expensive convolutional sub-network only on sparse key frames and propagates their deep feature maps to other frames via a flow field. The pipeline functions as a cycle of n frames. The first frame is called a key frame. This is the frame that gets detected by the object detector. After getting the optical flow on the next n-1 frames, the detection of the next n-1 frames are known, and the cycle repeats. DFF achieves significant speedup, as flow computation is relatively fast. The model achieves 73 MAP on ImageNet VID (with the R-FCN ResNet101 backbone) and 29 FPS for online settings.

Multi-frame Feature Aggregation with Optical Flow

A method to improve accuracy in video detection is a multi-frame feature aggregation. There are different ways of implementing it, but all of them revolve around one idea: densely computed per-frame detections, while feature warping from neighboring frames to the current frame, and aggregation with weighted averaging. The current frame will, therefore, benefit from the immediate frames, as well as some further frames, to get a better detection. This way allows solving the issues with motion and cropped subjects from a video frame.

Flow-Guided Feature Aggregation for Video Object Detection (FGFA)

Flow-guided feature aggregation using optical flow aggregates feature maps from nearby frames, which are aligned well through the estimated flow. The architecture is an end-to-end framework that leverages temporal coherence on a feature level. FGFA achieves 76.3 MAP on ImageNet VID (with the R-FCN ResNet101 backbone) and 1.4 FPS for online settings.

Towards High Performance Video Object Detection (THP)

THP takes a unified approach based on the principle of multi-frame end-to-end learning of features and cross-frame motion. It uses optical flow and sparsely recursive feature aggregation to retain the feature quality from aggregation. In addition, it reduces the computational cost by operating only on sparse key frames. Spatially-adaptive partial feature updating is used to recompute features on non-key frames wherever propagated features have bad quality. The feature quality is learnt in the end-to-end training, which further improves the recognition accuracy. Temporally-adaptive key frame scheduling predicts the usage of a key frame accordingly to the predicted feature quality, which makes the key frame usage more efficient. THP achieves 77.8 MAP on ImageNet VID (with the R-FCN Deformable ResNet101 backbone) and 22.9 FPS for online settings.

Multi-frame Feature Aggregation without Optical Flow

Memory Enhanced Global-Local Aggregation for Video Object Detection (MEGA)

MEGA augments candidate box features of key frames by effectively aggregating global and local information. It reuses the precomputed features obtained during the process of detection of previous frames, which are enhanced by global information and cached in Long Range Memory module. This is how recurrent connections between the current frame and the previous frames are built up. MEGA achieves 82.9 MAP on ImageNet VID (with the R-FCN ResNet101 backbone) and 8.7 FPS. MEGA with Seq-NMS and the R-FCN ResNeXt101 backbone achieves 85.4 MAP.

Mining Inter-Video Proposal Relations for Video Object Detection (HVRNet)

HVR-Net boosts video object detection by leveraging both intra-video and inter-video contexts within a multi-level triplet selection scheme. The triplet includes a target video, the most dissimilar video in the same category, and the most similar video in different categories, according to their CNN feature’s cosine similarity. For each video in the triplet, its sampled frames are fed into RPN and ROI layers of Faster RCNN. This produces feature vectors of object proposals for each frame, which are aggregated to enhance proposals in the target frame. Intra-video-enhanced proposals mainly contain object semantics in each individual video, while ignoring object variations among videos. To model such variations, hard proposal triplets are selected from the video triplet, according to the intra-video-enhanced features. For each proposal triplet, proposals from support videos are aggregated to enhance proposals in the target video. Each proposal feature further leverages inter-video dependencies to tackle object confusion among videos. HVRNet achieves 83.2 MAP on ImageNet VID (with the R-FCN ResNet101 backbone). HVRNet with Seq-NMS and the R-FCN ResNeXt101 backbone achieves a state-of-the-art result of 85.5 MAP.