cs.CV(2025-08-06)

📊 共 52 篇论文 | 🔗 18 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (23 🔗9) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (23 篇)

#题目一句话要点标签🔗
1 From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models 提出MLLMU-Med以解决生物医学多模态大语言模型的安全问题 large language model multimodal
2 Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models 提出O-Bench以解决多模态大语言模型的遮挡感知问题 large language model multimodal
3 UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval 提出UniFGVC以解决少样本细粒度视觉分类问题 large language model multimodal chain-of-thought
4 Revealing Temporal Label Noise in Multimodal Hateful Video Classification 提出细粒度标签噪声分析以提升多模态仇恨视频分类准确性 multimodal TAMP
5 FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging 提出FinMMR以提升金融数值推理的多模态能力 large language model multimodal
6 AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization 提出AD-FM框架以解决多模态异常检测中的适应性问题 large language model multimodal
7 Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability 提出输入审查能力评估框架以解决多模态模型输入错误识别问题 large language model multimodal
8 TotalRegistrator: Towards a Lightweight Foundation Model for CT Image Registration 提出TotalRegistrator以解决CT图像多器官配准问题 foundation model
9 Benchmarking Foundation Models for Mitotic Figure Classification 提出自监督学习方法以提升有丝分裂图像分类性能 foundation model
10 VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones 提出VisionTS++以解决视觉模型在时间序列预测中的跨模态转移问题 foundation model
11 Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction 提出意图增强扩散模型以解决多模态行人轨迹预测问题 multimodal
12 Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification 提出MMCAF-Net以解决小病灶误诊问题 multimodal
13 SVC 2025: the First Multimodal Deception Detection Challenge 提出SVC 2025挑战以解决多模态欺骗检测的跨域泛化问题 multimodal
14 X-SAM: From Segment Anything to Any Segmentation 提出X-SAM以解决现有图像分割模型的局限性 large language model multimodal
15 CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization 提出基于跨模态显著锚点的语义传播方法以解决弱监督密集音视频事件定位问题 multimodal TAMP
16 Static and Plugged: Make Embodied Evaluation Simple 提出StaticEmbodiedBench以解决现有评估方法的局限性 vision-language-action VLA
17 Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder 提出MLLMSeg以解决参考表达分割中的性能与成本问题 large language model multimodal
18 EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts 提出EncQA基准以提升图表理解的视觉推理能力 multimodal
19 Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan 提出FAME挑战以解决多语言环境中的人脸与声音关联问题 multimodal
20 Analyzing and Mitigating Object Hallucination: A Training Bias Perspective 提出Obliviate以解决大视觉语言模型的物体幻觉问题 multimodal
21 Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation 提出TGS-Agent以解决音频视觉分割中的对象理解问题 multimodal
22 Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting 提出针对视觉语言模型的持续学习方法以解决遗忘问题 multimodal
23 ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations 提出ToxicTAGS以解决有害表情包内容的标注与检测问题 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
24 Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning 提出VITAL框架以解决长视频推理中的多模态交互不足问题 reinforcement learning large language model multimodal
25 On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications 提出多模态特权知识蒸馏以提升视觉模型诊断能力 distillation multimodal
26 A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks 提出MAEPD模型以解决DAS信号识别中的数据分布不均问题 masked autoencoder spatiotemporal foundation model
27 CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework 提出CoMAD框架以解决自监督学习模型的资源限制问题 MAE contrastive learning distillation
28 Occupancy Learning with Spatiotemporal Memory 提出ST-Occ以解决3D占用率学习中的时空一致性问题 representation learning spatiotemporal
29 TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding 提出TSPO以解决长视频语言理解中的采样问题 reinforcement learning large language model multimodal
30 BEVCon: Advancing Bird's Eye View Perception with Contrastive Learning 提出BEVCon以提升自动驾驶中的鸟瞰视图感知 representation learning contrastive learning
31 Unmasking Interstitial Lung Diseases: Leveraging Masked Autoencoders for Diagnosis 利用掩码自编码器提升间质性肺病的诊断能力 masked autoencoder MAE
32 TopKD: Top-scaled Knowledge Distillation 提出TopKD以提升知识蒸馏中的logit信息利用 distillation
33 Learning Using Privileged Information for Litter Detection 提出结合特权信息的深度学习方法以提高垃圾检测精度 privileged information
34 S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation 提出S$^2$Q-VDiT以解决视频扩散模型的量化与学习挑战 distillation
35 ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs 提出ViFP框架以解决视觉语言模型中的错误推理问题 reinforcement learning distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
36 DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting 提出DET-GS以解决稀疏视图下3D重建精度不足问题 depth estimation metric depth 3D gaussian splatting
37 MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction 提出MuGS以解决多基线视图合成问题 depth estimation monocular depth gaussian splatting
38 CryoSplat: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction 提出CryoSplat以解决冷冻电子显微镜重建中的初始化问题 gaussian splatting splatting
39 What Holds Back Open-Vocabulary Segmentation? 提出新型组件以解决开放词汇分割的瓶颈问题 open-vocabulary open vocabulary
40 BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment 提出BridgeDepth以解决单目与立体深度估计的融合问题 depth estimation monocular depth stereo depth
41 Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens 提出一种方法将单目深度估计扩展至鱼眼相机 monocular depth
42 Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline 提出基于3D高斯映射的RGB SLAM方法以解决深度估计问题 visual SLAM optical flow SplaTAM
43 SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition 提出SplitGaussian以解决动态场景重建中的运动泄漏问题 gaussian splatting splatting scene reconstruction
44 IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control 提出IDC-Net以解决RGB-D视频生成中的几何一致性问题 scene reconstruction geometric consistency
45 PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction 提出PIS3R以解决大视差图像拼接问题 scene reconstruction

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
46 Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions 提出InterVLA数据集以解决人机交互理解问题 egocentric egocentric vision vision-language-action
47 One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion 提出OMFA框架以解决虚拟试衣与试脱的灵活性问题 SMPL SMPL-X
48 DOMR: Establishing Cross-View Segmentation via Dense Object Matching 提出DOMR框架以解决跨视角物体匹配问题 egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
49 DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling 提出DDTracking以解决扩散MRI轨迹重建问题 spatiotemporal
50 TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction 提出TurboTrain以解决多代理感知与预测的高效训练问题 spatiotemporal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
51 VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning 提出VisualTrans以解决现实场景中的视觉转化推理问题 manipulation sim-to-real human-object interaction

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
52 Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation 提出一种新框架以解决手语视频生成中的数据需求与泛化问题 motion synthesis multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页