cs.CV（2025-05-30）

📊 共 56 篇论文 | 🔗 16 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (18 🔗7) 支柱二：RL算法与架构 (RL & Architecture) (14 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗3) 支柱六：视频提取与匹配 (Video Extraction) (6 🔗1) 支柱一：机器人控制 (Robot Control) (4) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (18 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model	提出Period-LLM以解决多模态大语言模型在周期性任务中的不足	large language model multimodal	✅
2	Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts	提出Mixpert以解决多模态学习冲突问题	large language model multimodal
3	DisTime: Distribution-based Time Representation for Video Large Language Models	提出DisTime以解决视频大语言模型的时间表示问题	large language model TAMP	✅
4	Reasoning Can Hurt the Inductive Abilities of Large Language Models	提出结构化干预以提升大语言模型的归纳推理能力	large language model chain-of-thought
5	Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks	提出Agent-X以解决多步视觉推理任务评估问题	multimodal	✅
6	Geospatial Foundation Models to Enable Progress on Sustainable Development Goals	提出SustainFM框架以推动可持续发展目标的实现	foundation model
7	Beyond Quantity: Distribution-Aware Labeling for Visual Grounding	提出DAL框架以解决视觉定位中的标签分布问题	visual grounding
8	From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models	提出统一框架以解决大型基础模型的幻觉与越狱攻击问题	foundation model
9	Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT	提出MVPBench以解决多模态大语言模型的视觉物理推理问题	large language model multimodal chain-of-thought
10	The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models	提出局部扰动与全球影响原则以提升病理模型安全性	foundation model	✅
11	CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs	提出CSVQA以评估视觉语言模型的科学推理能力	multimodal	✅
12	Federated Foundation Model for GI Endoscopy Images	提出联邦基础模型以解决胃肠内镜图像数据隐私问题	foundation model
13	SiLVR: A Simple Language-based Video Reasoning Framework	提出SiLVR框架以解决复杂视频语言理解问题	large language model multimodal	✅
14	SORCE: Small Object Retrieval in Complex Environments	提出SORCE以解决复杂环境中小物体检索问题	large language model multimodal
15	Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders	提出Nar-KFC模块以解决长视频理解中的关键帧选择问题	large language model multimodal
16	Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation	提出Geo-Sign以提升手语翻译中的几何表示能力	large language model	✅
17	ViStoryBench: Comprehensive Benchmark Suite for Story Visualization	提出ViStoryBench以解决故事可视化评估不足问题	large language model
18	Conformal Prediction for Zero-Shot Models	提出Conf-OT以解决零样本模型的不确定性问题	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
19	MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning	提出多域数据混合策略以提升多模态LLM的强化学习能力	reinforcement learning large language model multimodal
20	Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization	提出PANSv2以解决支气管镜定位的鲁棒性与泛化问题	Mamba depth estimation foundation model
21	Reinforcing Video Reasoning with Focused Thinking	提出TW-GRPO以解决视频推理中的无效链条和奖励稀疏问题	reinforcement learning spatiotemporal large language model	✅
22	VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video	提出VideoCAD以解决复杂3D CAD界面交互学习问题	behavior cloning large language model multimodal
23	ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation	提出ACM-UNet以解决医疗图像分割中的结构不匹配问题	Mamba SSM state space model	✅
24	LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework	提出LTM3D以解决条件3D生成中的依赖建模问题	masked autoencoder 3D gaussian splatting gaussian splatting
25	A Mathematical Perspective On Contrastive Learning	提出一种数学视角的对比学习框架以解决多模态数据对齐问题	contrastive learning multimodal
26	Revisiting Cross-Modal Knowledge Distillation: A Disentanglement Approach for RGBD Semantic Segmentation	提出CroDiNo-KD以解决RGBD语义分割中的知识蒸馏问题	contrastive learning distillation
27	Progressive Class-level Distillation	提出渐进式类级蒸馏以解决知识蒸馏中的低概率类信息不足问题	teacher-student distillation
28	A Cross Branch Fusion-Based Contrastive Learning Framework for Point Cloud Self-supervised Learning	提出PoCCA框架以提升点云自监督学习效果	contrastive learning
29	EgoVIS@CVPR: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning	提出状态变化反事实以提升程序意识视频表示学习	representation learning
30	Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation	提出Reason-SVG以解决SVG生成中的推理不足问题	reinforcement learning large language model
31	STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence	提出STORK以解决扩散模型和流匹配模型的采样效率问题	flow matching	✅
32	State Estimation and Control of Dynamic Systems from High-Dimensional Image Data	提出一种新型神经架构以解决动态系统状态估计问题	reinforcement learning policy learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Tackling View-Dependent Semantics in 3D Language Gaussian Splatting	提出LaGa以解决3D场景中的视角依赖语义问题	3D gaussian splatting gaussian splatting splatting	✅
34	InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing	提出InteractAnything以解决零样本人机交互合成问题	affordance human-object interaction HOI
35	Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors	提出弱监督的可供性定位方法以解决标签稀缺问题	affordance human-object interaction egocentric	✅
36	un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP	提出un$^2$CLIP以提升CLIP在视觉细节捕捉能力的表现	open-vocabulary open vocabulary large language model	✅
37	3D Gaussian Splat Vulnerabilities	提出CLOAK与DAGGER以揭示3D高斯点云的安全漏洞	3D gaussian splatting 3DGS gaussian splatting
38	Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors	提出VG LLM以解决视频直接理解3D场景的问题	scene understanding large language model multimodal
39	6D Pose Estimation on Point Cloud Data through Prior Knowledge Integration: A Case Study in Autonomous Disassembly	提出基于先验知识的6D姿态估计方法以解决自动拆卸问题	6D pose estimation
40	AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion	提出AdaHuman以解决高质量3D人类头像生成问题	3DGS

🔬 支柱六：视频提取与匹配 (Video Extraction) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames	提出Disjoint-3DQA基准以解决长时间空间推理问题	egocentric embodied AI
42	Leadership Assessment in Pediatric Intensive Care Unit Team Training	提出自动化分析框架以评估PICU团队的领导能力	egocentric egocentric vision multimodal
43	Learning reusable concepts across different egocentric video understanding tasks	提出Hier-EgoPack框架以解决视频理解任务中的概念重用问题	egocentric
44	PCIE_Interaction Solution for Ego4D Social Interaction Challenge	提出PCIE_Interaction解决方案以应对Ego4D社交互动挑战	Ego4D	✅
45	Reading Recognition in the Wild	提出阅读识别任务以解决智能眼镜中的用户交互记录问题	egocentric multimodal
46	PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge	提出HP-ViT+解决RGB视频中的手部姿态估计问题	egocentric multimodal

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
47	Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces	提出VeBrain框架以解决多模态大语言模型在机器人控制中的整合问题	legged robot large language model multimodal
48	Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction	提出基于流的视频预测方法以解决双手操作策略泛化问题	manipulation bi-manual dual-arm
49	S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation	提出S4-Driver以解决自监督驾驶规划中的输入表示不足问题	motion planning large language model multimodal
50	Benchmarking Foundation Models for Zero-Shot Biometric Tasks	提出基于基础模型的零-shot生物识别任务基准评估	manipulation large language model foundation model

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
51	MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM	提出MIRAGE基准以评估多模态大语言模型中的幻觉问题	spatial relationship large language model multimodal
52	Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation	提出LongBench-T2I基准以解决复杂指令图像生成问题	spatial relationship large language model	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
53	S3CE-Net: Spike-guided Spatiotemporal Semantic Coupling and Expansion Network for Long Sequence Event Re-Identification	提出S3CE-Net以解决长序列事件重识别问题	spatiotemporal	✅
54	Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification	提出基于深度学习的框架以分类森林机械操作	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
55	Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes	提出Ctrl-Crash以解决真实汽车碰撞模拟问题	classifier-free guidance
56	MiniMax-Remover: Taming Bad Noise Helps Video Object Removal	提出MiniMax-Remover以解决视频对象移除中的噪声问题	classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页