余烨

个人信息Personal Information

副教授

博士生导师

教师拼音名称：Yu Ye

学科：计算机应用技术

同专业博导同专业硕导

论文成果

当前位置：中文主页 >> 科学研究 >> 论文成果

MRTP: Multiscale video anomaly detection with Representative Text Prompt

点击次数：

发表刊物：Image and Vision Computing (IVC)

关键字：Video anomaly detection 视频异常检测 Multimodal 多模态 Contrastive learning 对比学习 Multiscale 多尺度 Representative 代表性的

摘要：Weakly supervised video anomaly detection (VAD) aims to detect abnormal events based on video-level annotations. Pre-trained multimodal models have recently gained popularity in VAD due to their rich semantic information. However, existing studies often struggle to generate representative textual features, which limits their ability to characterize video anomalies. Additionally, most VAD methods rely on self-attention mechanisms to capture global context dependencies between video segments. They inevitably introduce long-range noise. Furthermore, the temporal alignment of textual and visual features remains a significant challenge. To address these issues, we propose a Multiscale VAD method with Representative Text Prompt (MRTP). Specifically, we introduce a Generalization Feature Sampling (GFS) module, which leverages the Conditional Variational Autoencoder (CVAE) to generate representative textual features. Meanwhile, we develop a Multiscale Dynamic Feature Fusion (MDFF) module that reduces the impact of long-range noise by fusing global context features and local abnormal features. Furthermore, we introduce a Visual-Textual Contrastive Learning (VTCL) module to select representative visual features considering the noise of feature norms and align them with textual features through contrastive learning. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of MRTP. It achieves an Area Under the Curve (AUC) of 98.27% on ShanghaiTech, 87.91% on UCF-Crime, and an Average Precision (AP) of 84.51% on XD-Violence.

论文类型：期刊论文

卷号：169, 10595

是否译文：否

发表时间：2026-03-31

收录刊物：SCI

发布期刊链接：https://www.sciencedirect.com/science/article/abs/pii/S0262885626000661

附件：

1-s2.0-S0262885626000661-main.pdf

上一条：A review of audio-visual fusion technology: Development, applications, and challenges

下一条：LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement