cs.CV（2025-08-06）

📊 共 52 篇论文 | 🔗 18 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (23 🔗9) 支柱二：RL算法与架构 (RL & Architecture) (12 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱六：视频提取与匹配 (Video Extraction) (3) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱一：机器人控制 (Robot Control) (1 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (23 篇)

#	题目	一句话要点	标签	🔗
1	From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models	提出MLLMU-Med以解决生物医学多模态大语言模型的安全问题	large language model multimodal
2	Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models	提出O-Bench以解决多模态大语言模型的遮挡感知问题	large language model multimodal
3	UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval	提出UniFGVC以解决少样本细粒度视觉分类问题	large language model multimodal chain-of-thought
4	Revealing Temporal Label Noise in Multimodal Hateful Video Classification	提出细粒度标签噪声分析以提升多模态仇恨视频分类准确性	multimodal TAMP	✅
5	FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging	提出FinMMR以提升金融数值推理的多模态能力	large language model multimodal
6	AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization	提出AD-FM框架以解决多模态异常检测中的适应性问题	large language model multimodal
7	Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability	提出输入审查能力评估框架以解决多模态模型输入错误识别问题	large language model multimodal	✅
8	TotalRegistrator: Towards a Lightweight Foundation Model for CT Image Registration	提出TotalRegistrator以解决CT图像多器官配准问题	foundation model	✅
9	Benchmarking Foundation Models for Mitotic Figure Classification	提出自监督学习方法以提升有丝分裂图像分类性能	foundation model
10	VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones	提出VisionTS++以解决视觉模型在时间序列预测中的跨模态转移问题	foundation model	✅
11	Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction	提出意图增强扩散模型以解决多模态行人轨迹预测问题	multimodal
12	Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification	提出MMCAF-Net以解决小病灶误诊问题	multimodal	✅
13	SVC 2025: the First Multimodal Deception Detection Challenge	提出SVC 2025挑战以解决多模态欺骗检测的跨域泛化问题	multimodal
14	X-SAM: From Segment Anything to Any Segmentation	提出X-SAM以解决现有图像分割模型的局限性	large language model multimodal	✅
15	CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization	提出基于跨模态显著锚点的语义传播方法以解决弱监督密集音视频事件定位问题	multimodal TAMP
16	Static and Plugged: Make Embodied Evaluation Simple	提出StaticEmbodiedBench以解决现有评估方法的局限性	vision-language-action VLA
17	Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder	提出MLLMSeg以解决参考表达分割中的性能与成本问题	large language model multimodal	✅
18	EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts	提出EncQA基准以提升图表理解的视觉推理能力	multimodal
19	Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan	提出FAME挑战以解决多语言环境中的人脸与声音关联问题	multimodal
20	Analyzing and Mitigating Object Hallucination: A Training Bias Perspective	提出Obliviate以解决大视觉语言模型的物体幻觉问题	multimodal
21	Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation	提出TGS-Agent以解决音频视觉分割中的对象理解问题	multimodal	✅
22	Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting	提出针对视觉语言模型的持续学习方法以解决遗忘问题	multimodal	✅
23	ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations	提出ToxicTAGS以解决有害表情包内容的标注与检测问题	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗
24	Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning	提出VITAL框架以解决长视频推理中的多模态交互不足问题	reinforcement learning large language model multimodal	✅
25	On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications	提出多模态特权知识蒸馏以提升视觉模型诊断能力	distillation multimodal
26	A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks	提出MAEPD模型以解决DAS信号识别中的数据分布不均问题	masked autoencoder spatiotemporal foundation model
27	CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework	提出CoMAD框架以解决自监督学习模型的资源限制问题	MAE contrastive learning distillation
28	Occupancy Learning with Spatiotemporal Memory	提出ST-Occ以解决3D占用率学习中的时空一致性问题	representation learning spatiotemporal
29	TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding	提出TSPO以解决长视频语言理解中的采样问题	reinforcement learning large language model multimodal	✅
30	BEVCon: Advancing Bird's Eye View Perception with Contrastive Learning	提出BEVCon以提升自动驾驶中的鸟瞰视图感知	representation learning contrastive learning
31	Unmasking Interstitial Lung Diseases: Leveraging Masked Autoencoders for Diagnosis	利用掩码自编码器提升间质性肺病的诊断能力	masked autoencoder MAE	✅
32	TopKD: Top-scaled Knowledge Distillation	提出TopKD以提升知识蒸馏中的logit信息利用	distillation
33	Learning Using Privileged Information for Litter Detection	提出结合特权信息的深度学习方法以提高垃圾检测精度	privileged information
34	S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation	提出S$^2$Q-VDiT以解决视频扩散模型的量化与学习挑战	distillation	✅
35	ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs	提出ViFP框架以解决视觉语言模型中的错误推理问题	reinforcement learning distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

#	题目	一句话要点	标签	🔗
36	DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting	提出DET-GS以解决稀疏视图下3D重建精度不足问题	depth estimation metric depth 3D gaussian splatting
37	MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction	提出MuGS以解决多基线视图合成问题	depth estimation monocular depth gaussian splatting	✅
38	CryoSplat: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction	提出CryoSplat以解决冷冻电子显微镜重建中的初始化问题	gaussian splatting splatting
39	What Holds Back Open-Vocabulary Segmentation?	提出新型组件以解决开放词汇分割的瓶颈问题	open-vocabulary open vocabulary
40	BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment	提出BridgeDepth以解决单目与立体深度估计的融合问题	depth estimation monocular depth stereo depth	✅
41	Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens	提出一种方法将单目深度估计扩展至鱼眼相机	monocular depth	✅
42	Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline	提出基于3D高斯映射的RGB SLAM方法以解决深度估计问题	visual SLAM optical flow SplaTAM
43	SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition	提出SplitGaussian以解决动态场景重建中的运动泄漏问题	gaussian splatting splatting scene reconstruction
44	IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control	提出IDC-Net以解决RGB-D视频生成中的几何一致性问题	scene reconstruction geometric consistency
45	PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction	提出PIS3R以解决大视差图像拼接问题	scene reconstruction

🔬 支柱六：视频提取与匹配 (Video Extraction) (3 篇)

#	题目	一句话要点	标签
46	Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions	提出InterVLA数据集以解决人机交互理解问题	egocentric egocentric vision vision-language-action
47	One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion	提出OMFA框架以解决虚拟试衣与试脱的灵活性问题	SMPL SMPL-X
48	DOMR: Establishing Cross-View Segmentation via Dense Object Matching	提出DOMR框架以解决跨视角物体匹配问题	egocentric

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
49	DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling	提出DDTracking以解决扩散MRI轨迹重建问题	spatiotemporal	✅
50	TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction	提出TurboTrain以解决多代理感知与预测的高效训练问题	spatiotemporal

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
51	VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning	提出VisualTrans以解决现实场景中的视觉转化推理问题	manipulation sim-to-real human-object interaction	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
52	Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation	提出一种新框架以解决手语视频生成中的数据需求与泛化问题	motion synthesis multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-08-06）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (23 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (3 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册