UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Intelligent surveillance with multimodal object detection in complex environments Cao, Yue

Abstract

Surveillance systems play a crucial role in ensuring public safety. With the advent of deep learning algorithms, these systems have evolved from passive monitoring tools that heavily relied on human operators, to advanced solutions capable of autonomously analyzing scenes with minimal human input. However, accurately detecting objects of interest in real-world scenarios presents a significant challenge due to the dynamic illumination and the varying sizes of objects. This research aims to enhance the accuracy and robustness of intelligent surveillance systems for object detection in complex environments by integrating two complementary sensor data: visible light (RGB) and infrared (IR) images. First, a multimodal detection framework is developed building upon the Faster R-CNN architecture, which is capable of integrating features from both RGB and IR images for enhanced object detection. Following this, Poolfuser, a transformer-based fusion module, is introduced and incorporated into the detection framework to fuse features from various modalities from spatial perspective. This approach emphasizes the critical features for target detection. Experimental results show that the multimodal framework equipped with Poolfuser significantly outperforms unimodal detectors and other competing multimodal approaches in terms of detection accuracy in complex environments. Secondly, to further improve the detection accuracy of the multimodal detection framework without introducing additional computational load, a lightweight fusion module based on Convolutional Neural Networks (CNN) is introduced. This module, termed Channel Switching and Spatial Attention (CSSA), integrates input features from both channel and spatial dimensions. The experimental results demonstrate that the CSSA module can further improve the detection accuracy without affecting the real-time performance of the detection framework. Finally, considering the impact of other components, such as the backbone network and the loss function on detection performance. This study further optimizes the CSSA-based multimodal detection model and introduces CSSA-Det. CSSA-Det shows improved object detection performance over CSSA and other state-of-the-art multimodal frameworks, particularly in the accuracy of bounding box localization.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International