Paper Publications
Release time: 2026-03-31Hits:
- Journal:Image and Vision Computing (IVC)
- Key Words:Video anomaly detection Multimodal Contrastive learning Multiscale Representative
- Abstract:Weakly supervised video anomaly detection (VAD) aims to detect abnormal events based on video-level annotations. Pre-trained multimodal models have recently gained popularity in VAD due to their rich semantic information. However, existing studies often struggle to generate representative textual features, which limits their ability to characterize video anomalies. Additionally, most VAD methods rely on self-attention mechanisms to capture global context dependencies between video segments. They inevitably introduce long-range noise. Furthermore, the temporal alignment of textual and visual features remains a significant challenge. To address these issues, we propose a Multiscale VAD method with Representative Text Prompt (MRTP). Specifically, we introduce a Generalization Feature Sampling (GFS) module, which leverages the Conditional Variational Autoencoder (CVAE) to generate representative textual features. Meanwhile, we develop a Multiscale Dynamic Feature Fusion (MDFF) module that reduces the impact of long-range noise by fusing global context features and local abnormal features. Furthermore, we introduce a Visual-Textual Contrastive Learning (VTCL) module to select representative visual features considering the noise of feature norms and align them with textual features through contrastive learning. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of MRTP. It achieves an Area Under the Curve (AUC) of 98.27% on ShanghaiTech, 87.91% on UCF-Crime, and an Average Precision (AP) of 84.51% on XD-Violence.
- Indexed by:Journal paper
- Volume:169, 10595
- Translation or Not:no
- Date of Publication:2026-03-31
- Included Journals:SCI
- Links to published journals:https://www.sciencedirect.com/science/article/abs/pii/S0262885626000661


MOBILE Version