- Browse by Subject
Browsing by Subject "R-CNN"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item A Multi-head Attention Approach with Complementary Multimodal Fusion for Vehicle Detection(2024-05) Tabassum, Nujhat; El-Sharkawy, Mohamed; King, Brian; Rizkalla, MaherThe advancement of autonomous vehicle technologies has taken a significant leap with the development of an improved version of the Multimodal Vehicle Detection Network (MVDNet), distinguished by the integration of a multi-head attention layer. This key enhancement significantly refines the network's capability to process and integrate multimodal sensor data, an aspect that becomes crucial in the face of challenging weather conditions. The effectiveness of this upgraded Multi-Head MVDNet is rigorously verified through an extensive dataset acquired from the Oxford Radar Robotcar, demonstrating its enhanced performance capabilities. Notably, in complex environmental conditions, the Multi-Head MVDNet shows a marked superiority in terms of Average Precision (AP) compared to existing models, underscoring its advanced detection capabilities. The transition from the traditional MVDNet to the enhanced Multi-Head Vehicle Detection Network signifies a notable breakthrough in the arena of vehicle detection technologies, with a special emphasis on operation under severe meteorological conditions, such as the obscuring presence of dense fog or the complexities introduced by heavy snowfall. This significant enhancement capitalizes on the foundational principles of the original MVDNet, which skillfully amalgamates the individual strengths of lidar and radar sensors. This is achieved through an intricate and refined process of feature tensor fusion, creating a more robust and comprehensive sensory data interpretation framework. A major innovation introduced in this updated model is the implementation of a multi-head attention layer. This layer serves as a sophisticated replacement for the previously employed self-attention mechanism. Segmenting the attention mechanism into several distinct partitions enhances the network's efficiency and accuracy in processing and interpreting vast arrays of sensor data. An exhaustive series of experimental analyses was undertaken to determine the optimal configuration of this multi-head attention mechanism. These experiments explored various combinations and settings, ultimately identifying a configuration consisting of seven distinct attention heads as the most effective. This setup was found to optimize the balance between computational efficiency and detection accuracy. When tested using the rich radar and lidar datasets from the ORR project, this advanced Multi-Head MVDNet configuration consistently demonstrated its superiority. It not only surpassed the performance of the original MVDNet but also showed marked improvements over models that relied solely on lidar data or the DEF models, especially in terms of vehicular detection accuracy. This enhancement in the MVDNet model, with its focus on multi-head attention, not only represents a significant leap in the field of autonomous vehicle detection but also lays a foundation for future research. It opens new pathways for exploring various attention mechanisms and their potential applicability in scenarios requiring real-time vehicle detection. Furthermore, it accentuates the importance of sophisticated sensor fusion techniques as vital tools in overcoming the challenges posed by adverse environmental conditions, thus paving the way for more resilient and reliable autonomous vehicular technologies.Item The clash between two worlds in human action recognition: supervised feature training vs Recurrent ConvNet(2016-11-28) Raptis, Konstantinos; Tsechpenakis, GavriilAction recognition has been an active research topic for over three decades. There are various applications of action recognition, such as surveillance, human-computer interaction, and content-based retrieval. Recently, research focuses on movies, web videos, and TV shows datasets. The nature of these datasets make action recognition very challenging due to scene variability and complexity, namely background clutter, occlusions, viewpoint changes, fast irregular motion, and large spatio-temporal search space (articulation configurations and motions). The use of local space and time image features shows promising results, avoiding the cumbersome and often inaccurate frame-by-frame segmentation (boundary estimation). We focus on two state of the art methods for the action classification problem: dense trajectories and recurrent neural networks (RNN). Dense trajectories use typical supervised training (e.g., with Support Vector Machines) of features such as 3D-SIFT, extended SURF, HOG3D, and local trinary patterns; the main idea is to densely sample these features in each frame and track them in the sequence based on optical flow. On the other hand, the deep neural network uses the input frames to detect action and produce part proposals, i.e., estimate information on body parts (shapes and locations). We compare qualitatively and numerically these two approaches, indicative to what is used today, and describe our conclusions with respect to accuracy and efficiency.