cs.CV（2025-10-20）

📊 共 39 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (9) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱八：物理动画 (Physics-based Animation) (2 🔗2) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues	提出MT-Video-Bench，用于评估多模态LLM在多轮对话中的视频理解能力	large language model multimodal
2	$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs	VisiPruner：解码多模态LLM中的非连续跨模态动态，实现高效剪枝	large language model multimodal	✅
3	Towards a Generalizable Fusion Architecture for Multimodal Object Detection	提出FMCAF架构，提升多模态目标检测的泛化能力与鲁棒性	multimodal
4	Glyph: Scaling Context Windows via Visual-Text Compression	Glyph：通过视觉-文本压缩扩展大语言模型的上下文窗口	large language model multimodal	✅
5	Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention	提出基于分层交错块注意力（HIBA）的Xihe，用于可扩展的零样本时间序列学习。	foundation model zero-shot transfer
6	iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA	提出iDETEX，赋能多模态大语言模型实现智能、详细、可解释的图像质量评估	large language model multimodal
7	SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference	SparseVILA：解耦视觉稀疏性，加速高效VLM推理	multimodal
8	Elastic ViTs from Pretrained Models without Retraining	提出SnapViT，无需重训练即可从预训练ViT模型中获得弹性计算能力。	foundation model
9	ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input	ImaGGen：基于语言和图像输入的零样本共语语义手势生成	multimodal	✅
10	Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization	提出一种上下文感知伪标签评分的零样本视频摘要框架，提升LLM在视频摘要任务中的性能。	large language model
11	Monitoring Horses in Stalls: From Object to Event Detection	提出基于YOLOv11和BoT-SORT的马厩马匹行为监测系统，实现事件自动检测。	foundation model
12	Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs	提出基于循环注意力的Token选择方法，用于高效的流式视频-LLM	large language model
13	Exploring The Missing Semantics In Event Modality	提出Semantic-E2VID，利用视觉语义知识增强事件到视频的重建效果	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
14	UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action	UltraCUA：融合GUI操作与高级工具的计算机使用Agent基础模型	reinforcement learning foundation model
15	Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model	提出IC-MoE模型，通过智能通信混合专家网络提升医学图像分割基础模型性能。	contrastive learning foundation model
16	Closed-Loop Transfer for Weakly-supervised Affordance Grounding	提出LoopTrans闭环框架，用于弱监督可供性区域定位，提升复杂交互场景性能。	distillation affordance egocentric
17	CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference	CausalMamba：用于神经因果推断的可扩展条件状态空间模型	Mamba state space model
18	Token-Level Inference-Time Alignment for Vision-Language Models	提出TITA：一种用于视觉-语言模型Token级推理时对齐的轻量级框架	DPO direct preference optimization multimodal
19	World-in-World: World Models in a Closed-Loop World	World-in-World：首个闭环世界模型基准平台，用于评估具身智能体的预测感知能力。	world model
20	Online In-Context Distillation for Low-Resource Vision Language Models	提出在线上下文蒸馏方法，提升低资源视觉语言模型性能。	distillation
21	SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries	SparseWorld：基于稀疏动态查询的灵活高效4D Occupancy世界模型	world model
22	GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image	GACO-CAD：通过几何增强与简洁性优化，从单张图像生成CAD模型	reinforcement learning large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
23	From Volume Rendering to 3D Gaussian Splatting: Theory and Applications	综述3D高斯溅射：从体渲染到应用，解决实时渲染与高质量重建难题	3D gaussian splatting 3DGS gaussian splatting
24	Raindrop GS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions	Raindrop GS：提出雨滴环境下3D高斯溅射重建的综合评测基准	3D gaussian splatting 3DGS gaussian splatting
25	Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models	提出PnF，利用多模态大语言模型增强现有运动预测模型，无需微调。	scene understanding large language model multimodal
26	Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS	提出更强的初始化流程ItG-GS，显著提升稀疏视角3DGS的渲染质量。	3D gaussian splatting 3DGS gaussian splatting	✅
27	PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception	PAGE-4D：解耦姿态与几何信息的动态场景VGGT-4D感知	depth estimation VGGT
28	Towards 3D Objectness Learning in an Open World	提出OP3Det，解决开放世界中无文本提示的通用3D目标检测问题	open-vocabulary open vocabulary foundation model
29	HouseTour: A Virtual Real Estate A(I)gent	HouseTour：提出一种利用扩散模型生成空间感知三维相机轨迹和自然语言摘要的方法，用于房地产场景。	3D gaussian splatting gaussian splatting splatting
30	DeepDetect: Learning All-in-One Dense Keypoints	DeepDetect：提出一种融合经典检测器优势的端到端密集关键点检测方法	visual odometry

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
31	GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation	GSPlane：通过结构化表示实现简洁而精确的平面重建	manipulation gaussian splatting splatting
32	SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving	SafeCoop：针对基于自然语言协同驾驶的全栈安全防御框架	manipulation	✅
33	ConsistEdit: Highly Consistent and Precise Training-free Visual Editing	ConsistEdit：提出一种高一致性和精确性的免训练视觉编辑方法	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
34	ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy	ManzaiSet：一个用于分析观众对日本漫才反应的多模态数据集	HuMoR multimodal
35	Leveraging AV1 motion vectors for Fast and Dense Feature Matching	利用AV1运动矢量实现快速密集特征匹配，提升SfM效率	feature matching

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
36	ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues	ViBED-Net：利用人脸和场景时空线索进行视频参与度检测	spatiotemporal	✅
37	MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models	MUG-V 10B：面向大规模视频生成模型的高效训练框架	spatiotemporal	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Capturing Head Avatar with Hand Contacts from a Monocular Video	提出一种单目视频头部Avatar重建方法，解决手部交互形变建模问题	penetration spatial relationship

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
39	ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling	ShapeCraft：利用LLM Agent生成结构化、纹理化和交互式3D模型	spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页