cs.CV(2025-12-16)

📊 共 52 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (9) 支柱三:空间感知 (Perception & SLAM) (7 🔗2) 支柱一:机器人控制 (Robot Control) (5) 支柱四:生成式动作 (Generative Motion) (3) 支柱七:动作重定向 (Motion Retargeting) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices HyperVL:面向边缘设备的高效动态多模态大语言模型 large language model multimodal
2 LLM-driven Knowledge Enhancement for Multimodal Cancer Survival Prediction 提出KEMM模型,利用LLM增强知识的多模态癌症生存预测。 large language model multimodal
3 Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries 提出RUNE,结合神经符号推理与大模型,解决遥感图像复杂查询的文本到图像检索问题。 large language model foundation model
4 Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency ReVision:基于大规模临床实践的视网膜原生智能模型,提升部署效率 foundation model
5 SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition 发布SignIT意大利手语数据集,并进行多模态手语识别基准分析 multimodal
6 OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving OmniGen:提出统一多模态传感器生成框架,用于自动驾驶场景数据增强。 multimodal
7 Real-time prediction of workplane illuminance distribution for daylight-linked controls using non-intrusive multimodal deep learning 提出基于非侵入式多模态深度学习的日光照明工作面照度实时预测方法,用于日光联动控制。 multimodal
8 ChartAgent: A Chart Understanding Framework with Tool Integrated Reasoning 提出ChartAgent,一个工具集成推理的图表理解框架,提升稀疏标注下的鲁棒性。 large language model multimodal
9 KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding 提出KFS-Bench基准,用于长视频问答中关键帧采样的全面评估。 large language model multimodal
10 ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking 提出ViRC框架,通过Reason Chunking增强视觉交错数学CoT推理能力 multimodal
11 FoodLogAthl-218: Constructing a Real-World Food Image Dataset Using Dietary Management Applications FoodLogAthl-218:构建基于膳食管理应用的真实食物图像数据集 multimodal
12 DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning 提出DISCODE,一种分布感知的分数解码器,用于提升图像描述自动评估的鲁棒性。 multimodal
13 Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs 提出EVPG,通过概率图增强视觉编程以提升视觉推理能力 large language model
14 TorchTraceAP: A New Benchmark Dataset for Detecting Performance Anti-Patterns in Computer Vision Models 提出TorchTraceAP基准数据集,用于检测计算机视觉模型中的性能反模式。 large language model
15 Selective, Controlled and Domain-Agnostic Unlearning in Pretrained CLIP: A Training- and Data-Free Approach 提出一种免训练免数据的CLIP可控选择性领域无关知识遗忘方法 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
16 DASP: Self-supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors DASP:利用时空先验域适应的自监督夜间单目深度估计 depth estimation monocular depth spatiotemporal
17 HGS: Hybrid Gaussian Splatting with Static-Dynamic Decomposition for Compact Dynamic View Synthesis 提出混合高斯溅射HGS,通过静态-动态分解实现紧凑的动态视角合成 3D gaussian splatting 3DGS gaussian splatting
18 GaussianPlant: Structure-aligned Gaussian Splatting for 3D Reconstruction of Plants GaussianPlant:提出结构对齐的高斯溅射方法,用于植物三维重建。 3D gaussian splatting 3DGS gaussian splatting
19 Beyond a Single Light: A Large-Scale Aerial Dataset for Urban Scene Reconstruction Under Varying Illumination SkyLume:一个大规模多光照城市重建航拍数据集,用于解决光照变化下的三维重建问题。 3D gaussian splatting gaussian splatting splatting
20 Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos ExpanDyNeRF:扩展动态场景视角合成,解决单目视频大角度渲染失真问题 gaussian splatting splatting NeRF
21 Consistent Instance Field for Dynamic Scene Understanding 提出一致性实例场,用于动态场景理解中的时空连续概率建模。 scene understanding open-vocabulary open vocabulary
22 Deep Learning Perspective of Scene Understanding in Autonomous Robots 综述深度学习在自主机器人场景理解中的应用,提升机器人感知与决策能力 visual SLAM depth estimation scene understanding
23 Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere 提出球Voronoi图,用于3D高斯溅射中可微的方向外观建模 3D gaussian splatting gaussian splatting splatting
24 ASAP-Textured Gaussians: Enhancing Textured Gaussians with Adaptive Sampling and Anisotropic Parameterization 提出自适应采样与各向异性参数化以解决纹理高效性问题 3D gaussian splatting gaussian splatting splatting
25 Robust Single-shot Structured Light 3D Imaging via Neural Feature Decoding 提出基于神经特征解码的鲁棒单目结构光3D成像方法,提升复杂场景下的深度估计精度。 depth estimation monocular depth feature matching
26 Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding Elastic3D:基于引导式潜在解码的可控立体视频转换方法 depth estimation

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
27 AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation AnchorHOI:基于锚点的先验知识蒸馏实现零样本4D人-物交互生成 distillation NeRF neural radiance field
28 WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling WorldPlay:提出一种具有长期几何一致性的实时交互式世界建模方法。 world model distillation geometric consistency
29 A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning 提出A4-Agent框架以解决零-shot可供性推理问题 dreamer affordance embodied AI
30 TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs TimeLens:利用多模态LLM重新思考视频时序定位任务,构建高质量基线。 reinforcement learning large language model multimodal
31 Unified Semantic Transformer for 3D Scene Understanding 提出UNITE:用于3D场景理解的统一语义Transformer模型 distillation scene understanding open-vocabulary
32 OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving OmniDrive-R1:强化学习驱动的交错多模态CoT,提升自动驾驶视觉语言模型的可靠性 reinforcement learning visual grounding chain-of-thought
33 PSMamba: Progressive Self-supervised Vision Mamba for Plant Disease Recognition PSMamba:一种用于植物病害识别的渐进式自监督视觉Mamba框架 Mamba representation learning teacher-student
34 S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation 提出S2D:一种稀疏到稠密的Keymask蒸馏方法,用于无监督视频实例分割 distillation
35 FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling FacEDiT:通过面部运动填充统一实现说话人脸编辑与生成 flow matching masked autoencoder

🔬 支柱三:空间感知 (Perception & SLAM) (7 篇)

#题目一句话要点标签🔗
36 ACE-SLAM: Scene Coordinate Regression for Neural Implicit Real-Time SLAM ACE-SLAM:基于场景坐标回归的神经隐式实时SLAM系统 SLAM localization
37 CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer 提出CLNet,通过跨视角对应关系增强图像检索地理定位 localization
38 4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation 提出4D-RaDiff,利用潜在扩散模型生成4D雷达点云,提升目标检测性能。 point cloud
39 History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation 提出历史增强型两阶段Transformer,解决无人机视觉语言导航中全局推理与局部理解的平衡问题 navigation
40 FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation FastDDHPose:统一、高效、解耦的3D人体姿态估计方法 pose estimation
41 TUN: Detecting Significant Points in Persistence Diagrams with Deep Learning 提出TUN网络,利用深度学习自动检测持久同调图中显著特征点。 point cloud
42 Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in Zoom-Zero:通过时间域缩放增强视频理解,解决GVQA中时序定位不准问题。 localization

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
43 CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives CRISP:基于单目视频和平面场景原语的接触引导Real2Sim方法 humanoid humanoid control real2sim
44 DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos DRAW2ACT:提出深度感知的轨迹条件视频生成框架,用于机器人操作演示视频生成。 manipulation embodied AI multimodal
45 Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity 提出协同中间特征操纵(SIFM)方法,提升图像针对恶意扩散模型编辑的免疫力。 manipulation large language model multimodal
46 Towards Transferable Defense Against Malicious Image Edits 提出TDAE框架,增强图像对恶意编辑的防御迁移能力 manipulation
47 Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure 提出Vector Prism,通过分层语义结构实现矢量图形动画 motion planning

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
48 Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models Sparse-LaViDa:通过稀疏化采样加速多模态离散扩散语言模型 MDM multimodal
49 ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body ViBES:一种具有行为智能的3D虚拟身体对话代理 text-to-motion motion generation multimodal
50 VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image VASA-3D:基于单张图像的逼真音频驱动高斯头部化身生成 motion latent

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
51 ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models ViewMask-1-to-3:基于多模态扩散模型实现多视角一致的图像生成 geometric consistency multimodal
52 SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing SketchAssist:用于语义编辑和精确局部重绘的实用草图助手 structure preservation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页