cs.CV（2025-09-18）

📊 共 34 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (14 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱一：机器人控制 (Robot Control) (3) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Chain-of-Thought Re-ranking for Image Retrieval Tasks	提出链式思考重排序方法CoTRR，提升多模态大语言模型在图像检索任务中的性能。	large language model multimodal chain-of-thought	✅
2	Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning	提出导航感知剪枝(NAP)，通过无监督多模态token剪枝提升视觉语言导航效率。	VLN large language model multimodal
3	How Good are Foundation Models in Step-by-Step Embodied Reasoning?	提出FoMER基准，评估具身环境中基础模型逐步推理能力	foundation model multimodal
4	Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding	利用多模态LLM进行零样本时空视频定位，提出DSTH和TAS策略。	large language model multimodal	✅
5	From Pixels to Urban Policy-Intelligence: Recovering Legacy Effects of Redlining with a Multimodal LLM	利用多模态LLM从像素到城市政策智能：重现红线政策的历史影响	large language model multimodal
6	Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation	提出用于多模态钢琴演奏数据集采集与指法标注的Web工具包	multimodal
7	Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications	研究CLIP微调在生物特征识别任务中泛化能力与过 специализации 的权衡	foundation model
8	Attention Lattice Adapter: Visual Explanation Generation for Visual Foundation Model	提出注意力格适配器(ALA)与交替周期架构(AEA)，用于视觉基础模型的视觉解释生成。	foundation model
9	V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety & Driver Behaviour Modelling	V-SenseDrive：面向道路安全与驾驶行为建模的隐私保护型道路视频与车内传感器融合框架	multimodal
10	ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models	提出ORCA框架，通过智能体推理提升视觉-语言模型在幻觉抑制和对抗鲁棒性上的表现。	multimodal
11	ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data	ScaleCUA：通过跨平台数据扩展开源计算机使用Agent	foundation model	✅
12	QuizRank: Picking Images by Quizzing VLMs	QuizRank：利用视觉语言模型进行问答式图像排序，提升维基百科文章配图质量。	large language model
13	Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification	提出跨模态几何校正（CMGR）框架，解决3D少样本类增量学习中的几何失准和纹理偏差问题。	foundation model
14	DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images	DACoN：利用DINO和任意数量参考图像的动漫线稿自动着色	foundation model	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
15	FMGS-Avatar: Mesh-Guided 2D Gaussian Splatting with Foundation Model Priors for 3D Monocular Avatar Reconstruction	FMGS-Avatar：利用基础模型先验的网格引导2D高斯溅射单目3D人像重建	distillation 3D gaussian splatting gaussian splatting
16	Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation	提出基于跨模态蒸馏的事件相机单目深度估计方法	distillation depth estimation monocular depth
17	Efficient Multimodal Dataset Distillation via Generative Models	提出EDGE方法以解决多模态数据集蒸馏效率问题	distillation large language model multimodal
18	Comparing Computational Pathology Foundation Models using Representational Similarity Analysis	利用表征相似性分析比较计算病理学领域多个预训练模型，揭示其表征结构差异。	contrastive learning distillation foundation model
19	Self-supervised learning of imaging and clinical signatures using a multimodal joint-embedding predictive architecture	利用多模态联合嵌入预测架构的自监督学习提升肺结节诊断	predictive model multimodal
20	NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training	NeuroRAD-FM：基于分布鲁棒训练的神经肿瘤学Foundation Model	MAE foundation model
21	Beyond Random Masking: A Dual-Stream Approach for Rotation-Invariant Point Cloud Masked Autoencoders	提出双流掩码自编码器，提升点云在旋转不变性下的表征学习能力	masked autoencoder MAE curriculum learning
22	Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception	提出AdaptiveNN，通过模仿人类自适应视觉实现高效灵活的机器视觉感知	reinforcement learning representation learning embodied AI	✅
23	Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks	分析自监督ViT在下游任务中的表征能力，探究最优特征选择策略。	contrastive learning distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation	VocAlign：面向开放词汇语义分割的无源域自适应词汇对齐方法	open-vocabulary open vocabulary
25	URNet: Uncertainty-aware Refinement Network for Event-based Stereo Depth Estimation	URNet：面向事件相机立体深度估计的、不确定性感知的优化网络	depth estimation stereo depth
26	NeRF-based Visualization of 3D Cues Supporting Data-Driven Spacecraft Pose Estimation	提出基于NeRF的3D视觉线索可视化方法，用于理解数据驱动的航天器姿态估计。	NeRF implicit representation
27	UCorr: Wire Detection and Depth Estimation for Autonomous Drones	提出UCorr，用于自主无人机细长物体（如电线）的检测与深度估计	depth estimation
28	RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes	提出ROS-Cam，仅用RGB视频即可高效优化动态场景相机参数	metric depth NeRF
29	Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model	提出基于置信度感知扩散模型的高效轻量多视图立体方法	depth estimation	✅
30	SPATIALGEN: Layout-guided 3D Indoor Scene Generation	SpatialGen：布局引导的3D室内场景生成模型	scene understanding

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
31	VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation	提出VLA-LPAF轻量级视角自适应融合模块，提升VLA模型在机器人操作中的泛化性	manipulation vision-language-action VLA
32	RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation	RynnVLA-001：利用人类演示提升机器人操作能力，提出双阶段预训练VLA模型。	manipulation vision-language-action VLA
33	Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies	利用YOLOv11和域随机化策略实现合成数据到真实场景的目标检测	domain randomization

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
34	SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters	SmolRGPT：用于仓库环境的高效空间推理600M参数视觉语言模型	spatial relationship multimodal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页