cs.CV(2025-10-03)

📊 共 30 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (5) 支柱二:RL算法与架构 (RL & Architecture) (5) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱四:生成式动作 (Generative Motion) (1 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 GAS-MIL: Group-Aggregative Selection Multi-Instance Learning for Ensemble of Foundation Models in Digital Pathology Image Analysis 提出GAS-MIL框架,用于数字病理图像分析中集成多个预训练模型。 foundation model multimodal
2 Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights 利用大型视觉-语言模型进行多模态颈动脉风险分层 multimodal
3 ELMF4EggQ: Ensemble Learning with Multimodal Feature Fusion for Non-Destructive Egg Quality Assessment ELMF4EggQ:多模态特征融合的集成学习用于鸡蛋无损质量评估 multimodal
4 Align Your Query: Representation Alignment for Multimodality Medical Object Detection 提出多模态上下文注意力机制以解决医学目标检测中的表示对齐问题 multimodal
5 TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency 提出TIT-Score,通过文本-图像-文本一致性评估长文本提示下的文图对齐质量 large language model multimodal
6 Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention HoloV:一种视觉token剪枝框架,通过全局上下文保留提升多模态大语言模型效率。 large language model multimodal
7 AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding 提出AdaRD-Key,用于查询驱动的长视频关键帧自适应采样,提升视频理解性能。 large language model multimodal
8 Domain Generalization for Semantic Segmentation: A Survey 领域泛化语义分割综述:分析方法与性能,强调基础模型的影响 foundation model
9 Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning Spatial-ViLT通过多任务学习增强视觉空间推理能力 multimodal
10 Visual Language Model as a Judge for Object Detection in Industrial Diagrams 提出基于视觉语言模型的工业图纸对象检测质量评估框架 multimodal
11 Towards Scalable and Consistent 3D Editing 提出3DEditFormer,实现可扩展且一致的3D编辑,并构建大规模数据集3DEditVerse。 foundation model
12 Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models 通过可解释性分析揭示视觉-语言模型在谜题推理中的认知局限 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
13 From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting 提出语义引导的动态3D高斯溅射运动控制方法,解决单目视频动态重建中的控制点分配难题。 3D gaussian splatting gaussian splatting splatting
14 Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes 利用多模态LLM高效微调,解决低数据量下的目标检测问题 scene understanding large language model
15 ROGR: Relightable 3D Objects using Generative Relighting ROGR:利用生成式光照重构可重新光照的3D物体模型 NeRF neural radiance field
16 FSFSplatter: Build Surface and Novel Views with Sparse-Views within 2min FSFSplatter:提出快速表面重建方法,仅用稀疏视图在2分钟内构建场景。 gaussian splatting splatting
17 Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles 提出基于潜空间集成的随机共振对抗攻击防御方法,无需训练且适用多种任务。 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
18 LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models LEAML:面向多模态大语言模型,实现标签高效的领域外视觉任务自适应 distillation large language model multimodal
19 Training-Free Out-Of-Distribution Segmentation With Foundation Models 提出一种免训练的异常分割方法,利用预训练模型进行域外检测。 representation learning foundation model
20 Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval 提出Retrv-R1,一种基于推理驱动的多模态大语言模型框架,用于通用且高效的多模态检索。 reinforcement learning multimodal
21 PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology PEaRL:通过通路增强表示学习,从组织学图像预测基因和通路表达 representation learning contrastive learning multimodal
22 Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models Smart-GRPO:优化噪声采样,提升Flow-Matching模型强化学习效率 reinforcement learning flow matching

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
23 Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields 研究几何信息在神经辐射场语义蒸馏中的作用,并提出SPINE框架实现无初始猜测的辐射场反演。 manipulation distillation gaussian splatting
24 SketchPlan: Diffusion Based Drone Planning From Human Sketches SketchPlan:基于扩散模型的无人机规划,从人类草图生成飞行路径 sim-to-real 3D gaussian splatting gaussian splatting
25 Mask2IV: Interaction-Centric Video Generation via Mask Trajectories Mask2IV:通过Mask轨迹实现交互中心视频生成,无需密集Mask标注。 manipulation affordance human-object interaction
26 Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime! 提出DragStream,实现基于拖拽的流式交互视频编辑,支持任意对象、任意时刻的精细控制。 manipulation
27 MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding 提出MaskCD,通过图像头掩码对比解码缓解LVLM幻觉问题 manipulation multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
28 MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context MoGIC:通过意图理解和视觉上下文增强运动生成 text-driven motion motion synthesis motion generation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
29 ReeMark: Reeb Graphs for Simulating Patterns of Life in Spatiotemporal Trajectories 提出ReeMark,利用Reeb图模拟时空轨迹中的生活模式,用于城市规划等。 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
30 GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion GeoComplete:提出几何感知扩散模型,用于参考图像驱动的图像补全,显著提升几何一致性。 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页