MRTP: Multiscale video anomaly detection with Representative Text Prompt
点击次数:
发表刊物:Image and Vision Computing (IVC)
关键字:Video anomaly detection 视频异常检测 Multimodal 多模态 Contrastive learning 对比学习 Multiscale 多尺度 Representative 代表性的
摘要:Weakly supervised video anomaly detection (VAD) aims to detect abnormal events based on video-level annotations. Pre-trained multimodal models have recently gained popularity in VAD due to their rich semantic information. However, existing studies often struggle to generate representative textual features, which limits their ability to characterize video anomalies. Additionally, most VAD methods rely on self-attention mechanisms to capture global context dependencies between video segments. They inevitably introduce long-range noise. Furthermore, the temporal alignment of textual and visual features remains a significant challenge. To address these issues, we propose a Multiscale VAD method with Representative Text Prompt (MRTP). Specifically, we introduce a Generalization Feature Sampling (GFS) module, which leverages the Conditional Variational Autoencoder (CVAE) to generate representative textual features. Meanwhile, we develop a Multiscale Dynamic Feature Fusion (MDFF) module that reduces the impact of long-range noise by fusing global context features and local abnormal features. Furthermore, we introduce a Visual-Textual Contrastive Learning (VTCL) module to select representative visual features considering the noise of feature norms and align them with textual features through contrastive learning. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of MRTP. It achieves an Area Under the Curve (AUC) of 98.27% on ShanghaiTech, 87.91% on UCF-Crime, and an Average Precision (AP) of 84.51% on XD-Violence.
论文类型:期刊论文
卷号:169, 10595
是否译文:否
发表时间:2026-03-31
收录刊物:SCI
发布期刊链接:https://www.sciencedirect.com/science/article/abs/pii/S0262885626000661

