cs.CV(2025-10-02)

📊 共 33 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱一:机器人控制 (Robot Control) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations 利用盲人和低视力人群视觉问题引导多模态大语言模型,实现主动视觉解读 large language model multimodal
2 Inferring Dynamic Physical Properties from Video Foundation Models 利用视频基础模型推断视频中的动态物理属性 large language model foundation model
3 ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models 提出ImageNet-Think-250K,用于提升视觉语言模型多模态推理能力。 multimodal
4 microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification microCLIP:通过粗细粒度Token融合实现无监督CLIP微调,提升细粒度图像分类性能 large language model zero-shot transfer
5 Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework 利用GPT-4o和SLSO框架自动生成牙科全景片中颌骨囊肿的诊断结果 multimodal chain-of-thought
6 Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs 提出Patch-as-Decodable-Token (PaDT),实现MLLM中统一的多模态视觉任务处理。 large language model multimodal
7 Growing Visual Generative Capacity for Pre-Trained MLLMs 提出Bridge:一种基于混合Transformer架构的纯自回归统一多模态大语言模型,提升视觉生成能力。 large language model multimodal
8 How Confident are Video Models? Empowering Video Models to Express their Uncertainty 提出一种框架以量化视频模型的不确定性 large language model
9 VideoNSA: Native Sparse Attention Scales Video Understanding 提出VideoNSA,通过原生稀疏注意力有效扩展视频理解模型的上下文长度。 multimodal
10 VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL VidGuard-R1:利用推理MLLM和强化学习进行AI生成视频检测与解释 large language model
11 From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding 提出F2C:通过高效关键片段选择提升长视频理解能力 large language model
12 FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation 提出FRIEREN框架,利用视觉-语言正则化进行联邦学习语义分割,解决无标签数据下的领域泛化问题。 foundation model
13 OpusAnimation: Code-Based Dynamic Chart Generation 提出DCG-Bench基准和Qwen2.5-VL-DCG-3B模型,用于解决动态图表生成任务。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
14 VLA-R1: Enhancing Reasoning in Vision-Language-Action Models 提出VLA-R1以解决视觉-语言-行动模型推理不足问题 reinforcement learning reward design affordance
15 GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation GeoPurify通过几何蒸馏,以数据高效的方式实现开放词汇3D分割。 distillation open-vocabulary open vocabulary
16 RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning 提出RewardMap,通过多阶段强化学习解决细粒度视觉推理中的稀疏奖励问题 reinforcement learning reward design large language model
17 MultiModal Action Conditioned Video Generation 提出多模态动作条件视频生成模型,提升机器人精细操作的模拟精度 world model multimodal
18 DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing DragFlow:利用区域监督释放DiT先验,实现卓越的拖拽编辑效果 flow matching large language model multimodal
19 Flow-Matching Guided Deep Unfolding for Hyperspectral Image Reconstruction 提出Flow-Matching引导的深度展开网络FMU,用于高光谱图像重建。 flow matching HSI
20 Towards Better Optimization For Listwise Preference in Diffusion Models 提出Diffusion-LPO,用于扩散模型中基于列表偏好的优化,提升图像质量和偏好对齐。 reinforcement learning RLHF DPO
21 Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery 提出离散面部编码(DFE),用于数据驱动的面部表情发现,替代FACS。 representation learning masked autoencoder VQ-VAE
22 Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback 提出Oracle-RLAIF框架,通过排序反馈强化学习提升多模态视频模型性能。 reinforcement learning
23 Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning 提出基于Rollout引导的自适应像素空间推理框架,提升VLM在细粒度视觉任务上的效率和准确性。 reinforcement learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
24 LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction LoBE-GS:面向大规模场景重建的负载均衡高效3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
25 StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions StealthAttack:提出一种基于密度引导的3D高斯溅射隐蔽投毒攻击方法 3D gaussian splatting 3DGS gaussian splatting
26 4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing 提出4DGS-Craft以解决4D高斯点云编辑一致性问题 gaussian splatting splatting VGGT
27 Visual Odometry with Transformers 提出基于Transformer的视觉里程计VoT,实现端到端单目位姿回归。 visual odometry feature matching foundation model
28 GaussianMorphing: Mesh-Guided 3D Gaussians for Semantic-Aware Object Morphing GaussianMorphing:提出网格引导的3D高斯方法,实现语义感知的物体形变。 3D gaussian splatting 3DGS gaussian splatting
29 Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale 提出Con-NRSfM,通过可恢复共形尺度微分几何解决非刚性结构重建问题。 depth estimation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
30 Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions 提出基于真实世界交互学习物体声音的检测框架,解决声音与物体的关联问题。 egocentric multimodal
31 Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig 提出一种移动多相机系统,用于在真实场景中进行ego-exo 3D手部追踪。 egocentric

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
32 PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction PhysHMR:从视觉学习人形控制策略,实现物理上合理的人体运动重建 humanoid humanoid control reinforcement learning

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
33 Learning to Generate Rigid Body Interactions with Video Diffusion Models KineMask:利用视频扩散模型生成具有刚体交互的视频 physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页