cs.CV(2026-03-06)

📊 共 59 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (20 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (18 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (17 🔗5) 支柱一:机器人控制 (Robot Control) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (20 篇)

#题目一句话要点标签🔗
1 Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion Place-it-R1:利用多模态大语言模型实现环境感知视频对象插入 DPO direct preference optimization scene understanding
2 CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection 提出CR-QAT,解决OVOD低比特量化中视觉-语言对齐和关系结构扭曲问题 distillation open-vocabulary open vocabulary
3 CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning 提出CORE-Seg,通过强化学习驱动的推理分割,解决复杂病灶分割难题。 reinforcement learning large language model multimodal
4 Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models Curious-VLA:通过探索增强,提升自动驾驶VLA模型的性能 reinforcement learning imitation learning VLA
5 EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking EgoReasoner:通过任务自适应结构化思考学习第一人称视角下的4D推理 reinforcement learning egocentric chain-of-thought
6 Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement 利用冻结的预训练模型特征,通过线性探针实现连续物理测量的几何信息提取。 MAE foundation model
7 MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis 提出MoEMambaMIL,用于WSI分析的结构感知选择性状态空间建模。 Mamba SSM state space model
8 Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models 提出GvU:通过理解驱动的内在奖励机制,提升统一多模态模型的生成能力。 reinforcement learning multimodal
9 PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues PatchCue:利用图像块视觉线索增强视觉-语言模型推理能力 reinforcement learning multimodal chain-of-thought
10 Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis 提出Self-Flow自监督流匹配,提升多模态合成的可扩展性和生成质量。 flow matching representation learning
11 Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders Penguin-VL:利用LLM初始化视觉编码器,探索高效VLM的性能极限 contrastive learning multimodal
12 Training Flow Matching: The Role of Weighting and Parameterization 研究流匹配模型训练目标,分析权重、参数化等因素对生成质量的影响 flow matching
13 What if? Emulative Simulation with World Models for Situated Reasoning 提出WanderDream数据集,用于世界模型在情境推理中的模拟探索 world model
14 LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtents LATO:提出一种拓扑保持的隐空间表示,实现可扩展的基于流匹配的3D网格生成。 flow matching
15 WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching 提出WorldCache以解决扩散模型推理效率低下问题 world model
16 Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention 提出空间稀疏线性注意力(SSLA),用于低延迟事件相机目标检测。 linear attention
17 Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning 提出对比-自监督双阶段框架,用于文字相似性学习。 contrastive learning teacher-student distillation
18 Cross-Resolution Distribution Matching for Diffusion Distillation 提出RMD框架,通过跨分辨率分布匹配加速高保真扩散模型蒸馏。 distillation
19 Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models 提出S2I编码,利用视觉预训练模型进行自监督骨骼表示学习。 representation learning
20 TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation 提出TempoSyncDiff,用于低延迟、时序稳定的音频驱动说话人头部生成 teacher-student distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (18 篇)

#题目一句话要点标签🔗
21 Transforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting 提出一种RGB-LiDAR到3D高斯溅射的转换方法,用于高效构建大规模数字孪生。 3D gaussian splatting 3DGS gaussian splatting
22 EntON: Eigenentropy-Optimized Neighborhood Densification in 3D Gaussian Splatting EntON:基于特征熵优化的3D高斯溅射邻域稠密化,提升几何精度与渲染质量。 3D gaussian splatting 3DGS gaussian splatting
23 CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis CylinderSplat:利用柱面Triplane的3D高斯溅射实现全景新视角合成 3D gaussian splatting 3DGS gaussian splatting
24 VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction VG3S:利用视觉几何先验的高斯溅射实现语义占据预测 3D gaussian splatting gaussian splatting splatting
25 FTSplat: Feed-forward Triangle Splatting Network 提出FTSplat,通过前馈三角形splatting网络实现高效三维重建 3D gaussian splatting 3DGS gaussian splatting
26 NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving NOVA:面向自动驾驶,提出基于开放词汇自回归的3D多目标跟踪方法 open-vocabulary open vocabulary large language model
27 JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas JOPP-3D:联合点云与全景图的开放词汇语义分割框架 scene understanding open-vocabulary open vocabulary
28 DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model DeepSight:首个深度驱动的多模态模型,弥合深度图与语言之间的鸿沟,提升三维场景理解。 scene understanding large language model multimodal
29 Exploring Open-Vocabulary Object Recognition in Images using CLIP 提出基于CLIP的开放词汇目标识别框架,无需复杂训练且泛化性强 open-vocabulary open vocabulary
30 Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction 提出频谱诊断框架,评估2D-to-3D重建中特征上采样方法对3D感知的贡献 scene reconstruction geometric consistency foundation model
31 FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models FreeOcc:利用预训练模型实现免训练的全景占据预测 scene understanding foundation model
32 EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition EventGeM:用于事件相机视觉定位的全局到局部特征匹配方法 depth estimation feature matching foundation model
33 RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation RePer-360:通过自调制释放透视先验,用于360°深度估计 depth estimation foundation model
34 AV-Unified: A Unified Framework for Audio-visual Scene Understanding 提出AV-Unified统一框架,用于多任务联合学习的音视频场景理解 scene understanding spatiotemporal
35 CHMv2: Improvements in Global Canopy Height Mapping using DINOv3 CHMv2:利用DINOv3改进全球冠层高度图绘制,提升精度与细节 depth estimation height map
36 PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction PixARMesh:提出一种自回归网格原生单视图场景重建方法 scene reconstruction
37 Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image Pano3DComposer:基于单张全景图像的前馈式可组合3D场景生成 VGGT geometric consistency
38 Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation Rewis3d:利用3D重建提升弱监督语义分割性能 scene reconstruction

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
39 Multimodal Large Language Models as Image Classifiers 通过修正评估协议与标注,提升多模态大语言模型在图像分类任务上的性能 large language model multimodal
40 TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis TumorChain:用于可追溯临床肿瘤分析的交错多模态思维链推理 multimodal chain-of-thought
41 Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion Omni-Diffusion:基于掩码离散扩散模型的统一多模态理解与生成框架 large language model foundation model multimodal
42 Lyapunov Probes for Hallucination Detection in Large Foundation Models 提出Lyapunov Probes,通过动态系统稳定性理论检测大模型幻觉 large language model foundation model multimodal
43 MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs 提出MM-ISTS框架,利用多模态LLM协同处理不规则采样时间序列预测问题 large language model multimodal
44 Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving 提出冗余建模以提升自动驾驶多源多模态数据质量 multimodal
45 GreenRFM: Toward a resource-efficient radiology foundation model 提出GreenRFM,一种资源高效的放射学基础模型,在性能上超越现有模型。 foundation model
46 Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events 提出CoE:一种基于事件链的无训练多模态摘要框架,提升跨模态融合和时序建模能力。 multimodal
47 SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection 提出SpaCRD,融合组织学和空间转录组学数据,实现跨平台癌症区域精准检测。 multimodal
48 Longitudinal NSCLC Treatment Progression via Multimodal Generative Models 提出剂量感知的多模态生成模型,用于预测NSCLC放疗期间的肿瘤演变。 multimodal
49 FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography FontUse提出了一种数据驱动的方法,用于生成风格和用例可控的图像内排版。 large language model multimodal
50 EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation EffectMaker:统一推理与生成,实现定制化视觉特效创建 large language model multimodal
51 CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization 提出CaTok,通过MeanFlow解码器实现一维因果图像Token化,提升图像重建质量。 foundation model
52 HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models HiPP-Prune:面向视觉-语言模型的分层偏好条件结构化剪枝 visual grounding
53 GazeMoE: Perception of Gaze Target with Mixture-of-Experts 提出GazeMoE以解决人类注视目标估计问题 foundation model
54 Point-Supervised Skeleton-Based Human Action Segmentation 提出基于点监督的骨骼动作分割框架,降低标注成本并提升性能。 multimodal
55 OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer 提出OVGGT,实现恒定成本的流式视觉几何Transformer,解决长视频3D重建问题。 foundation model

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
56 Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots 提出运动图灵测试框架,评估人形机器人运动的类人程度,并构建HHMotion数据集。 humanoid humanoid robot motion generation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
57 Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching Match4Annotate:通过隐式神经特征匹配传播稀疏视频标注,解决医学影像等领域标注难题。 feature matching spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
58 FlowMotion: Training-Free Flow Guidance for Video Motion Transfer FlowMotion:利用光流引导实现视频动作迁移,无需训练且高效灵活。 motion representation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
59 GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection GenHOI:通过时序平衡和空间选择的对象注入,实现对象一致的手部-物体交互视频生成。 physically plausible HOI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页