cs.CV(2025-09-18)

📊 共 34 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱一:机器人控制 (Robot Control) (3) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 Chain-of-Thought Re-ranking for Image Retrieval Tasks 提出链式思考重排序方法CoTRR,提升多模态大语言模型在图像检索任务中的性能。 large language model multimodal chain-of-thought
2 Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning 提出导航感知剪枝(NAP),通过无监督多模态token剪枝提升视觉语言导航效率。 VLN large language model multimodal
3 How Good are Foundation Models in Step-by-Step Embodied Reasoning? 提出FoMER基准,评估具身环境中基础模型逐步推理能力 foundation model multimodal
4 Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding 利用多模态LLM进行零样本时空视频定位,提出DSTH和TAS策略。 large language model multimodal
5 From Pixels to Urban Policy-Intelligence: Recovering Legacy Effects of Redlining with a Multimodal LLM 利用多模态LLM从像素到城市政策智能:重现红线政策的历史影响 large language model multimodal
6 Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation 提出用于多模态钢琴演奏数据集采集与指法标注的Web工具包 multimodal
7 Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications 研究CLIP微调在生物特征识别任务中泛化能力与过 специализации 的权衡 foundation model
8 Attention Lattice Adapter: Visual Explanation Generation for Visual Foundation Model 提出注意力格适配器(ALA)与交替周期架构(AEA),用于视觉基础模型的视觉解释生成。 foundation model
9 V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety & Driver Behaviour Modelling V-SenseDrive:面向道路安全与驾驶行为建模的隐私保护型道路视频与车内传感器融合框架 multimodal
10 ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models 提出ORCA框架,通过智能体推理提升视觉-语言模型在幻觉抑制和对抗鲁棒性上的表现。 multimodal
11 ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data ScaleCUA:通过跨平台数据扩展开源计算机使用Agent foundation model
12 QuizRank: Picking Images by Quizzing VLMs QuizRank:利用视觉语言模型进行问答式图像排序,提升维基百科文章配图质量。 large language model
13 Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification 提出跨模态几何校正(CMGR)框架,解决3D少样本类增量学习中的几何失准和纹理偏差问题。 foundation model
14 DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images DACoN:利用DINO和任意数量参考图像的动漫线稿自动着色 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
15 FMGS-Avatar: Mesh-Guided 2D Gaussian Splatting with Foundation Model Priors for 3D Monocular Avatar Reconstruction FMGS-Avatar:利用基础模型先验的网格引导2D高斯溅射单目3D人像重建 distillation 3D gaussian splatting gaussian splatting
16 Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation 提出基于跨模态蒸馏的事件相机单目深度估计方法 distillation depth estimation monocular depth
17 Efficient Multimodal Dataset Distillation via Generative Models 提出EDGE方法以解决多模态数据集蒸馏效率问题 distillation large language model multimodal
18 Comparing Computational Pathology Foundation Models using Representational Similarity Analysis 利用表征相似性分析比较计算病理学领域多个预训练模型,揭示其表征结构差异。 contrastive learning distillation foundation model
19 Self-supervised learning of imaging and clinical signatures using a multimodal joint-embedding predictive architecture 利用多模态联合嵌入预测架构的自监督学习提升肺结节诊断 predictive model multimodal
20 NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training NeuroRAD-FM:基于分布鲁棒训练的神经肿瘤学Foundation Model MAE foundation model
21 Beyond Random Masking: A Dual-Stream Approach for Rotation-Invariant Point Cloud Masked Autoencoders 提出双流掩码自编码器,提升点云在旋转不变性下的表征学习能力 masked autoencoder MAE curriculum learning
22 Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception 提出AdaptiveNN,通过模仿人类自适应视觉实现高效灵活的机器视觉感知 reinforcement learning representation learning embodied AI
23 Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks 分析自监督ViT在下游任务中的表征能力,探究最优特征选择策略。 contrastive learning distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
24 Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation VocAlign:面向开放词汇语义分割的无源域自适应词汇对齐方法 open-vocabulary open vocabulary
25 URNet: Uncertainty-aware Refinement Network for Event-based Stereo Depth Estimation URNet:面向事件相机立体深度估计的、不确定性感知的优化网络 depth estimation stereo depth
26 NeRF-based Visualization of 3D Cues Supporting Data-Driven Spacecraft Pose Estimation 提出基于NeRF的3D视觉线索可视化方法,用于理解数据驱动的航天器姿态估计。 NeRF implicit representation
27 UCorr: Wire Detection and Depth Estimation for Autonomous Drones 提出UCorr,用于自主无人机细长物体(如电线)的检测与深度估计 depth estimation
28 RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes 提出ROS-Cam,仅用RGB视频即可高效优化动态场景相机参数 metric depth NeRF
29 Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model 提出基于置信度感知扩散模型的高效轻量多视图立体方法 depth estimation
30 SPATIALGEN: Layout-guided 3D Indoor Scene Generation SpatialGen:布局引导的3D室内场景生成模型 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
31 VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation 提出VLA-LPAF轻量级视角自适应融合模块,提升VLA模型在机器人操作中的泛化性 manipulation vision-language-action VLA
32 RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation RynnVLA-001:利用人类演示提升机器人操作能力,提出双阶段预训练VLA模型。 manipulation vision-language-action VLA
33 Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies 利用YOLOv11和域随机化策略实现合成数据到真实场景的目标检测 domain randomization

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
34 SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters SmolRGPT:用于仓库环境的高效空间推理600M参数视觉语言模型 spatial relationship multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页