cs.CV（2025-10-02）

📊 共 33 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱一：机器人控制 (Robot Control) (1) 支柱四：生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations	利用盲人和低视力人群视觉问题引导多模态大语言模型，实现主动视觉解读	large language model multimodal	✅
2	Inferring Dynamic Physical Properties from Video Foundation Models	利用视频基础模型推断视频中的动态物理属性	large language model foundation model
3	ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models	提出ImageNet-Think-250K，用于提升视觉语言模型多模态推理能力。	multimodal
4	microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification	microCLIP：通过粗细粒度Token融合实现无监督CLIP微调，提升细粒度图像分类性能	large language model zero-shot transfer	✅
5	Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework	利用GPT-4o和SLSO框架自动生成牙科全景片中颌骨囊肿的诊断结果	multimodal chain-of-thought
6	Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs	提出Patch-as-Decodable-Token (PaDT)，实现MLLM中统一的多模态视觉任务处理。	large language model multimodal	✅
7	Growing Visual Generative Capacity for Pre-Trained MLLMs	提出Bridge：一种基于混合Transformer架构的纯自回归统一多模态大语言模型，提升视觉生成能力。	large language model multimodal
8	How Confident are Video Models? Empowering Video Models to Express their Uncertainty	提出一种框架以量化视频模型的不确定性	large language model
9	VideoNSA: Native Sparse Attention Scales Video Understanding	提出VideoNSA，通过原生稀疏注意力有效扩展视频理解模型的上下文长度。	multimodal
10	VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL	VidGuard-R1：利用推理MLLM和强化学习进行AI生成视频检测与解释	large language model
11	From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding	提出F2C：通过高效关键片段选择提升长视频理解能力	large language model
12	FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation	提出FRIEREN框架，利用视觉-语言正则化进行联邦学习语义分割，解决无标签数据下的领域泛化问题。	foundation model
13	OpusAnimation: Code-Based Dynamic Chart Generation	提出DCG-Bench基准和Qwen2.5-VL-DCG-3B模型，用于解决动态图表生成任务。	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
14	VLA-R1: Enhancing Reasoning in Vision-Language-Action Models	提出VLA-R1以解决视觉-语言-行动模型推理不足问题	reinforcement learning reward design affordance	✅
15	GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation	GeoPurify通过几何蒸馏，以数据高效的方式实现开放词汇3D分割。	distillation open-vocabulary open vocabulary	✅
16	RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning	提出RewardMap，通过多阶段强化学习解决细粒度视觉推理中的稀疏奖励问题	reinforcement learning reward design large language model
17	MultiModal Action Conditioned Video Generation	提出多模态动作条件视频生成模型，提升机器人精细操作的模拟精度	world model multimodal
18	DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing	DragFlow：利用区域监督释放DiT先验，实现卓越的拖拽编辑效果	flow matching large language model multimodal
19	Flow-Matching Guided Deep Unfolding for Hyperspectral Image Reconstruction	提出Flow-Matching引导的深度展开网络FMU，用于高光谱图像重建。	flow matching HSI	✅
20	Towards Better Optimization For Listwise Preference in Diffusion Models	提出Diffusion-LPO，用于扩散模型中基于列表偏好的优化，提升图像质量和偏好对齐。	reinforcement learning RLHF DPO
21	Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery	提出离散面部编码(DFE)，用于数据驱动的面部表情发现，替代FACS。	representation learning masked autoencoder VQ-VAE
22	Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback	提出Oracle-RLAIF框架，通过排序反馈强化学习提升多模态视频模型性能。	reinforcement learning
23	Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning	提出基于Rollout引导的自适应像素空间推理框架，提升VLM在细粒度视觉任务上的效率和准确性。	reinforcement learning multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
24	LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction	LoBE-GS：面向大规模场景重建的负载均衡高效3D高斯溅射	3D gaussian splatting 3DGS gaussian splatting
25	StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions	StealthAttack：提出一种基于密度引导的3D高斯溅射隐蔽投毒攻击方法	3D gaussian splatting 3DGS gaussian splatting	✅
26	4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing	提出4DGS-Craft以解决4D高斯点云编辑一致性问题	gaussian splatting splatting VGGT
27	Visual Odometry with Transformers	提出基于Transformer的视觉里程计VoT，实现端到端单目位姿回归。	visual odometry feature matching foundation model
28	GaussianMorphing: Mesh-Guided 3D Gaussians for Semantic-Aware Object Morphing	GaussianMorphing：提出网格引导的3D高斯方法，实现语义感知的物体形变。	3D gaussian splatting 3DGS gaussian splatting	✅
29	Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale	提出Con-NRSfM，通过可恢复共形尺度微分几何解决非刚性结构重建问题。	depth estimation

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions	提出基于真实世界交互学习物体声音的检测框架，解决声音与物体的关联问题。	egocentric multimodal
31	Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig	提出一种移动多相机系统，用于在真实场景中进行ego-exo 3D手部追踪。	egocentric

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
32	PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction	PhysHMR：从视觉学习人形控制策略，实现物理上合理的人体运动重建	humanoid humanoid control reinforcement learning

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Learning to Generate Rigid Body Interactions with Video Diffusion Models	KineMask：利用视频扩散模型生成具有刚体交互的视频	physically plausible	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页