cs.CV(2025-05-30)

📊 共 56 篇论文 | 🔗 16 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (18 🔗7) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (6 🔗1) 支柱一:机器人控制 (Robot Control) (4) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (18 篇)

#题目一句话要点标签🔗
1 Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model 提出Period-LLM以解决多模态大语言模型在周期性任务中的不足 large language model multimodal
2 Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts 提出Mixpert以解决多模态学习冲突问题 large language model multimodal
3 DisTime: Distribution-based Time Representation for Video Large Language Models 提出DisTime以解决视频大语言模型的时间表示问题 large language model TAMP
4 Reasoning Can Hurt the Inductive Abilities of Large Language Models 提出结构化干预以提升大语言模型的归纳推理能力 large language model chain-of-thought
5 Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks 提出Agent-X以解决多步视觉推理任务评估问题 multimodal
6 Geospatial Foundation Models to Enable Progress on Sustainable Development Goals 提出SustainFM框架以推动可持续发展目标的实现 foundation model
7 Beyond Quantity: Distribution-Aware Labeling for Visual Grounding 提出DAL框架以解决视觉定位中的标签分布问题 visual grounding
8 From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models 提出统一框架以解决大型基础模型的幻觉与越狱攻击问题 foundation model
9 Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT 提出MVPBench以解决多模态大语言模型的视觉物理推理问题 large language model multimodal chain-of-thought
10 The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models 提出局部扰动与全球影响原则以提升病理模型安全性 foundation model
11 CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs 提出CSVQA以评估视觉语言模型的科学推理能力 multimodal
12 Federated Foundation Model for GI Endoscopy Images 提出联邦基础模型以解决胃肠内镜图像数据隐私问题 foundation model
13 SiLVR: A Simple Language-based Video Reasoning Framework 提出SiLVR框架以解决复杂视频语言理解问题 large language model multimodal
14 SORCE: Small Object Retrieval in Complex Environments 提出SORCE以解决复杂环境中小物体检索问题 large language model multimodal
15 Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders 提出Nar-KFC模块以解决长视频理解中的关键帧选择问题 large language model multimodal
16 Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation 提出Geo-Sign以提升手语翻译中的几何表示能力 large language model
17 ViStoryBench: Comprehensive Benchmark Suite for Story Visualization 提出ViStoryBench以解决故事可视化评估不足问题 large language model
18 Conformal Prediction for Zero-Shot Models 提出Conf-OT以解决零样本模型的不确定性问题 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
19 MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning 提出多域数据混合策略以提升多模态LLM的强化学习能力 reinforcement learning large language model multimodal
20 Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization 提出PANSv2以解决支气管镜定位的鲁棒性与泛化问题 Mamba depth estimation foundation model
21 Reinforcing Video Reasoning with Focused Thinking 提出TW-GRPO以解决视频推理中的无效链条和奖励稀疏问题 reinforcement learning spatiotemporal large language model
22 VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video 提出VideoCAD以解决复杂3D CAD界面交互学习问题 behavior cloning large language model multimodal
23 ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation 提出ACM-UNet以解决医疗图像分割中的结构不匹配问题 Mamba SSM state space model
24 LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework 提出LTM3D以解决条件3D生成中的依赖建模问题 masked autoencoder 3D gaussian splatting gaussian splatting
25 A Mathematical Perspective On Contrastive Learning 提出一种数学视角的对比学习框架以解决多模态数据对齐问题 contrastive learning multimodal
26 Revisiting Cross-Modal Knowledge Distillation: A Disentanglement Approach for RGBD Semantic Segmentation 提出CroDiNo-KD以解决RGBD语义分割中的知识蒸馏问题 contrastive learning distillation
27 Progressive Class-level Distillation 提出渐进式类级蒸馏以解决知识蒸馏中的低概率类信息不足问题 teacher-student distillation
28 A Cross Branch Fusion-Based Contrastive Learning Framework for Point Cloud Self-supervised Learning 提出PoCCA框架以提升点云自监督学习效果 contrastive learning
29 EgoVIS@CVPR: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning 提出状态变化反事实以提升程序意识视频表示学习 representation learning
30 Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation 提出Reason-SVG以解决SVG生成中的推理不足问题 reinforcement learning large language model
31 STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence 提出STORK以解决扩散模型和流匹配模型的采样效率问题 flow matching
32 State Estimation and Control of Dynamic Systems from High-Dimensional Image Data 提出一种新型神经架构以解决动态系统状态估计问题 reinforcement learning policy learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
33 Tackling View-Dependent Semantics in 3D Language Gaussian Splatting 提出LaGa以解决3D场景中的视角依赖语义问题 3D gaussian splatting gaussian splatting splatting
34 InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing 提出InteractAnything以解决零样本人机交互合成问题 affordance human-object interaction HOI
35 Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors 提出弱监督的可供性定位方法以解决标签稀缺问题 affordance human-object interaction egocentric
36 un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP 提出un$^2$CLIP以提升CLIP在视觉细节捕捉能力的表现 open-vocabulary open vocabulary large language model
37 3D Gaussian Splat Vulnerabilities 提出CLOAK与DAGGER以揭示3D高斯点云的安全漏洞 3D gaussian splatting 3DGS gaussian splatting
38 Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors 提出VG LLM以解决视频直接理解3D场景的问题 scene understanding large language model multimodal
39 6D Pose Estimation on Point Cloud Data through Prior Knowledge Integration: A Case Study in Autonomous Disassembly 提出基于先验知识的6D姿态估计方法以解决自动拆卸问题 6D pose estimation
40 AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion 提出AdaHuman以解决高质量3D人类头像生成问题 3DGS

🔬 支柱六:视频提取与匹配 (Video Extraction) (6 篇)

#题目一句话要点标签🔗
41 Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames 提出Disjoint-3DQA基准以解决长时间空间推理问题 egocentric embodied AI
42 Leadership Assessment in Pediatric Intensive Care Unit Team Training 提出自动化分析框架以评估PICU团队的领导能力 egocentric egocentric vision multimodal
43 Learning reusable concepts across different egocentric video understanding tasks 提出Hier-EgoPack框架以解决视频理解任务中的概念重用问题 egocentric
44 PCIE_Interaction Solution for Ego4D Social Interaction Challenge 提出PCIE_Interaction解决方案以应对Ego4D社交互动挑战 Ego4D
45 Reading Recognition in the Wild 提出阅读识别任务以解决智能眼镜中的用户交互记录问题 egocentric multimodal
46 PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge 提出HP-ViT+解决RGB视频中的手部姿态估计问题 egocentric multimodal

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
47 Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces 提出VeBrain框架以解决多模态大语言模型在机器人控制中的整合问题 legged robot large language model multimodal
48 Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction 提出基于流的视频预测方法以解决双手操作策略泛化问题 manipulation bi-manual dual-arm
49 S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation 提出S4-Driver以解决自监督驾驶规划中的输入表示不足问题 motion planning large language model multimodal
50 Benchmarking Foundation Models for Zero-Shot Biometric Tasks 提出基于基础模型的零-shot生物识别任务基准评估 manipulation large language model foundation model

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
51 MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM 提出MIRAGE基准以评估多模态大语言模型中的幻觉问题 spatial relationship large language model multimodal
52 Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation 提出LongBench-T2I基准以解决复杂指令图像生成问题 spatial relationship large language model

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
53 S3CE-Net: Spike-guided Spatiotemporal Semantic Coupling and Expansion Network for Long Sequence Event Re-Identification 提出S3CE-Net以解决长序列事件重识别问题 spatiotemporal
54 Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification 提出基于深度学习的框架以分类森林机械操作 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
55 Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes 提出Ctrl-Crash以解决真实汽车碰撞模拟问题 classifier-free guidance
56 MiniMax-Remover: Taming Bad Noise Helps Video Object Removal 提出MiniMax-Remover以解决视频对象移除中的噪声问题 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页