cs.CV(2026-03-04)

📊 共 40 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (13 🔗3) 支柱九:具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗3) 支柱一:机器人控制 (Robot Control) (3) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱四:生成式动作 (Generative Motion) (2 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
1 EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR EgoPoseFormer v2:用于AR/VR的精准第一人称视角人体运动估计 teacher-student distillation egocentric
2 PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation PROSPECT:通过语义-空间融合和潜在预测表征实现统一的流式视觉-语言导航 predictive model representation learning vision-language-action
3 Scaling Dense Event-Stream Pretraining from Visual Foundation Models 提出一种基于视觉基础模型的事件流预训练方法,解决事件表示的语义坍塌问题。 distillation foundation model
4 Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection 提出CMDR-IAD以解决多模态工业异常检测问题 teacher-student multimodal
5 From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning 提出AVAR框架,解决多模态大模型冷启动阶段的注意力分配问题,显著提升推理性能。 reward shaping multimodal
6 CoRe-BT: A Multimodal Radiology-Pathology-Text Benchmark for Robust Brain Tumor Typing CoRe-BT:用于鲁棒性脑肿瘤分型的多模态放射-病理-文本基准数据集 representation learning multimodal
7 Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning 提出基于注视稳定性和瞳孔新颖性的双重标准框架策展方法,用于高效的以自我为中心的学习。 imitation learning egocentric
8 Discriminative Perception via Anchored Description for Reasoning Segmentation 提出DPAD,通过锚定描述实现判别感知,提升推理分割性能。 reinforcement learning large language model multimodal
9 Separators in Enhancing Autoregressive Pretraining for Vision Mamba 提出STAR,通过分隔符增强Vision Mamba的自回归预训练,提升长序列处理能力。 Mamba state space model
10 TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning TaxonRL:利用强化学习与中间奖励实现可解释的细粒度视觉推理 reinforcement learning
11 DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers DiverseDiT:通过扩散Transformer中的多样性表示学习提升图像合成质量。 representation learning
12 UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization UniRain:提出基于RAG的数据集蒸馏和多目标重加权优化的统一图像去雨框架 distillation
13 Vector-Quantized Soft Label Compression for Dataset Distillation 提出基于向量量化自编码器的软标签压缩方法,用于加速数据集蒸馏并降低存储开销。 distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
14 Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe 提出一种单样本探针方法,用于预测VLFM在欠表示领域上的零样本精度。 large language model foundation model
15 Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions 提出图像提示注入攻击,利用视觉嵌入对抗指令劫持多模态大语言模型 large language model multimodal
16 PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters 提出PlaneCycle,无需训练和适配器即可将2D预训练模型迁移至3D任务 foundation model
17 Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers 针对小patch病理图像,任务特定CNN优于预训练模型,且数据量是关键 foundation model
18 ProFound: A moderate-sized vision foundation model for multi-task prostate imaging ProFound:用于多任务前列腺成像的中等规模视觉基础模型 foundation model
19 Towards Generalized Multimodal Homography Estimation 提出一种广义多模态单应性估计方法,提升跨模态泛化能力 multimodal
20 Universal Pansharpening Foundation Model 提出FoundPS通用Pansharpening基础模型,实现卫星无关和场景鲁棒的图像融合。 foundation model
21 RIVER: A Real-Time Interaction Benchmark for Video LLMs 提出RIVER基准,用于评估视频大语言模型在实时交互场景下的性能。 large language model multimodal
22 EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs EvoPrune:面向高效多模态大语言模型的早期视觉Token剪枝 large language model multimodal
23 Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection Pointer-CAD:通过指针式边/面选择统一B-Rep和命令序列,提升CAD生成质量 large language model
24 SSR: A Generic Framework for Text-Aided Map Compression for Localization 提出SSR框架,利用文本辅助地图压缩,提升定位效率并降低存储成本。 large language model
25 RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation 提出RAGTrack,利用检索增强生成框架解决RGBT跟踪中目标建模和模态融合问题。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
26 EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding 提出EmbodiedSplat,用于在线开放词汇3D场景理解的feed-forward语义3DGS方法。 3DGS scene understanding open-vocabulary
27 DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping DISC:用于大规模开放集语义地图构建的密集集成语义上下文方法 semantic mapping semantic map
28 Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation Crab$^{+}$:通过显式协作实现可扩展的统一音视频场景理解模型 scene understanding large language model multimodal
29 Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning 提出结构感知Prompt适配方法,提升开放词汇组合零样本学习的泛化能力 open-vocabulary open vocabulary
30 Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding 提出知识增强的细粒度推理Agent(KFRA),解决开放集细粒度视觉理解问题。 open-vocabulary open vocabulary multimodal
31 Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements Yolo-Key-6D:基于关键点增强的单阶段单目6D位姿估计 6D pose estimation
32 Glass Segmentation with Fusion of Learned and General Visual Features 提出融合学习特征与通用视觉特征的玻璃分割网络,提升透明物体识别精度。 scene understanding foundation model
33 ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training ZipMap:线性时间、状态式三维重建与测试时训练 VGGT

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
34 ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors ArtHOI:通过视频先验的4D重建合成可动的人-物交互 manipulation optical flow physically plausible
35 Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation 提出MMFA,通过无监督关键点定位实现可控人脸动画 manipulation representation learning
36 InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models 提出InEdit-Bench,用于评估图像编辑模型在中间逻辑路径上的推理能力。 manipulation multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
37 SimpliHuMoN: Simplifying Human Motion Prediction SimpliHuMoN:提出一种简化的Transformer模型,用于人体运动预测,实现多任务SOTA。 human motion human motion prediction motion prediction
38 InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions InfinityStory提出背景一致、角色感知的长视频生成框架,实现小时级叙事视频合成。 spatial relationship

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
39 NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction NOVA3R:用于非像素对齐的Amodal 3D重建的视觉Transformer physically plausible
40 Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance 提出基于嵌入式龙格-库塔引导的扩散采样方法,利用求解器误差提升图像生成质量。 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页