cs.CV(2026-04-06)

📊 共 82 篇论文

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (36) 支柱三:空间感知与语义 (Perception & Semantics) (19) 支柱二:RL算法与架构 (RL & Architecture) (16) 支柱一:机器人控制 (Robot Control) (4) 支柱四:生成式动作 (Generative Motion) (2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (36 篇)

#题目一句话要点标签🔗
1 QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models QAPruner:面向多模态大语言模型的量化感知视觉Token剪枝 large language model multimodal
2 ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization 提出ForgeryGPT,用于可解释的图像伪造检测与定位,并支持交互式对话。 large language model multimodal instruction following
3 Multimodal Language Models Cannot Spot Spatial Inconsistencies 提出多视角空间一致性评估方法,揭示MLLM在3D推理上的不足 large language model multimodal
4 Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs 提出 Guideline2Graph,将临床指南解析为可执行的临床决策图,显著提升解析精度。 multimodal
5 Token-Efficient Multimodal Reasoning via Image Prompt Packaging 提出图像提示打包方法以降低多模态推理成本 multimodal
6 Rapidly deploying on-device eye tracking by distilling visual foundation models DistillGaze:通过蒸馏视觉基础模型实现快速部署的设备端眼动追踪 foundation model
7 Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery Smart Transfer:利用视觉基础模型快速绘制震后高分辨率影像的建筑物损毁图 foundation model
8 MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications MOMO:用于火星轨道应用的多传感器融合火星轨道模型 foundation model
9 MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling MMPhysVideo:通过联合多模态建模提升视频生成中物理合理性 multimodal
10 Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models 提出基于程序几何数据生成和视觉语言模型的几何教育视觉解释方法 visual grounding
11 The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment 提出对比融合ConFu框架,用于捕获高阶多模态对齐中的复杂依赖关系。 multimodal
12 EGM: Efficient Visual Grounding Language Models 提出EGM:通过生成更多中等质量tokens,提升小型视觉语言模型在视觉定位任务中的效率。 visual grounding
13 MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models MuRF:释放视觉基础模型的多尺度潜力,提升推理性能 foundation model
14 Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs Efficient3D:用于3D MLLM中自适应和去偏Token缩减的统一框架 large language model multimodal
15 Token Warping Helps MLLMs Look from Nearby Viewpoints Token Warping:提升多模态大语言模型在视角变换下的推理能力 large language model multimodal
16 Progressive Video Condensation with MLLM Agent for Long-form Video Understanding 提出ProVCA:一种基于MLLM Agent的渐进式视频精简方法,用于长视频理解 large language model multimodal
17 SentiAvatar: Towards Expressive and Interactive Digital Humans SentiAvatar:构建富有表现力和交互性的数字人框架 foundation model multimodal
18 PolyReal: A Benchmark for Real-World Polymer Science Workflows PolyReal:面向真实世界聚合物科学工作流的多模态大语言模型基准 large language model multimodal
19 MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs MI-Pruner:基于互信息的跨模态视觉Token剪枝方法,提升MLLM效率 large language model multimodal
20 CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models CoDA:探索医学视觉-语言模型中的链式分布攻击与事后Token空间修复 large language model multimodal
21 Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation 提出Inertia-aware Visual Excitation方法,缓解多模态大语言模型中的认知幻觉问题 large language model multimodal
22 VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors 视觉语言模型过度依赖语义锚点,忽略视觉细节,限制了其视觉推理能力。 multimodal
23 EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors EnsemHalDet:通过集成内部状态检测器实现鲁棒的VLM幻觉检测 multimodal
24 SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection 提出基于稀疏自编码器的稀疏投影引导SPG,用于零样本异常检测。 foundation model
25 CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation 提出CrossWeaver,用于任意模态语义分割的跨模态融合框架 multimodal
26 QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection 提出QVAD框架以解决视频异常检测中的静态查询问题 foundation model
27 A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification 提出基于数据增强的SAR海冰分类ViT基线,提升稀有冰类识别精度。 multimodal
28 Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models 提出UCGP,针对红外视觉-语言模型的通用物理对抗补丁框架 multimodal
29 EffiMiniVLM: A Compact Dual-Encoder Regression Framework 提出EffiMiniVLM,一种紧凑的双编码器回归框架,用于解决冷启动场景下的产品质量预测问题。 multimodal
30 Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes 提出GSAM,通过随机裁剪高效微调SAM以适应可变输入图像尺寸 foundation model
31 Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding 提出基于隐空间的匿名化适配模块,用于保护视频理解模型的隐私 foundation model
32 SAGA: Source Attribution of Generative AI Videos SAGA:首个生成式AI视频溯源框架,实现多粒度模型溯源与可解释性分析。 foundation model
33 Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation 提出基于低秩解码器自适应的高效测试时深度补全方法 foundation model
34 When Negation Is a Geometry Problem in Vision-Language Models 提出基于表征工程的测试时干预方法,提升CLIP模型对文本否定语义的理解能力。 multimodal
35 Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks 提出随机标签桥接训练,实现语言模型向视觉任务的有效迁移 large language model
36 Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance 揭示视觉语言模型在几何变换下的脆弱性,挑战其视觉不变性 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (19 篇)

#题目一句话要点标签🔗
37 Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles ArticuSurDepth:针对铰接车辆的自监督环视深度估计,提升跨车辆几何一致性 depth estimation metric depth geometric consistency
38 GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes GP-4DGS:基于变分高斯过程的单目视频概率4D高斯溅射 gaussian splatting splatting motion estimation
39 SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction SparseSplat:首个自适应高斯密度的前馈3D高斯溅射,适用于下游重建任务 3D gaussian splatting 3DGS gaussian splatting
40 Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting 提出MM-GS框架,利用3D高斯溅射实现多人多物交互动态场景的渲染。 3D gaussian splatting gaussian splatting splatting
41 VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping 提出VBGS-SLAM以解决SLAM中的姿态优化与地图演变问题 3D gaussian splatting 3DGS gaussian splatting
42 DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection 提出DeCo-DETR,通过解耦认知实现高效的开放词汇目标检测 open-vocabulary open vocabulary multimodal
43 MedGS: Gaussian Splatting for Multi-Modal 3D Medical Imaging 提出MedGS框架以解决内窥镜图像重建中的光照伪影问题 3D gaussian splatting gaussian splatting splatting
44 Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars 使用更优人体模型,而非更大网络,提升高斯Avatar重建效果 3D gaussian splatting gaussian splatting splatting
45 FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting FACT-GS:面向2D高斯溅射的频率对齐复杂度感知纹理重参数化 gaussian splatting splatting
46 Uncertainty-Aware 4D Gaussian Splatting for Monocular Occluded Human Rendering 提出不确定性感知的4D高斯溅射,解决单目遮挡人体渲染问题。 gaussian splatting splatting
47 Environment-Aware Channel Prediction for Vehicular Communications: A Multimodal Visual Feature Fusion Framework 提出一种基于多模态视觉特征融合的环境感知信道预测框架,用于车载通信。 depth estimation multimodal
48 TrackerSplat: Exploiting Point Tracking for Fast and Robust Dynamic 3D Gaussians Reconstruction TrackerSplat:利用点追踪加速和增强动态3D高斯重建的鲁棒性 3D gaussian splatting 3DGS gaussian splatting
49 NavCrafter: Exploring 3D Scenes from a Single Image NavCrafter:提出单图驱动的3D场景探索框架,实现可控视角合成与高保真重建 3D gaussian splatting 3DGS gaussian splatting
50 Factorized Multi-Resolution HashGrid for Efficient Neural Radiance Fields: Execution on Edge-Devices 提出Fact-Hash,用于边缘设备上高效神经辐射场的参数编码 NeRF neural radiance field
51 Satellite-Free Training for Drone-View Geo-Localization 提出一种无需卫星图像训练的无人机视角地理定位框架 3D gaussian splatting gaussian splatting splatting
52 From Elevation Maps To Contour Lines: SVM and Decision Trees to Detect Violin Width Reduction 利用SVM和决策树,从高程图和轮廓线中自动检测小提琴宽度变化 elevation map
53 Scene Grounding In the Wild 提出基于语义对齐的场景定位框架,解决野外场景三维重建难题 3D gaussian splatting gaussian splatting splatting
54 Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation 提出基于双重参数化的可微笔画规划方法,高效高保真地生成绘画作品 gaussian splatting splatting
55 DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass DePT3R:单次前向传播实现动态场景的联合稠密点追踪与3D重建 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (16 篇)

#题目一句话要点标签🔗
56 ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving ExploreVLA:面向端到端自动驾驶的稠密世界建模与探索 reinforcement learning imitation learning behavior cloning
57 RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection RayMamba:通过射线对齐序列化增强远距离3D目标检测 Mamba SSM state space model
58 FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment FLEX:用于健身动作质量评估的大规模多模态多视角数据集 representation learning multimodal
59 Edge-Efficient Two-Stream Multimodal Architecture for Non-Intrusive Bathroom Fall Detection 提出一种边缘高效的双流多模态架构,用于非侵入式浴室跌倒检测。 Mamba multimodal
60 TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs TARS:一种MinMax Token自适应偏好策略,用于减少多模态大语言模型中的幻觉 DPO direct preference optimization large language model
61 3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars 3DXTalker:统一身份、口型同步、情感和空间动态的表达性3D说话头像生成。 flow matching motion generation
62 Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding 提出UniScene3D,通过对比语言着色点云预训练实现统一3D场景理解 representation learning scene understanding
63 DM3D: Deformable Mamba via Offset-Guided Differentiable Scanning for Point Cloud Understanding 提出DM3D,通过可变形Mamba和可微扫描实现点云理解 Mamba SSM state space model
64 DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization 提出DiFlowDubber,通过离散流匹配实现跨模态对齐和同步的自动视频配音。 flow matching multimodal
65 FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder 提出FusionBERT,通过跨注意力视觉融合和法线感知3D编码器实现多视角图像-3D检索。 representation learning multimodal
66 EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment 提出EvaNet,一种高效且与人类视觉感知更一致的红外与可见光图像融合评估框架 contrastive learning large language model
67 FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO FaVChat:利用层级提示查询引导的面部视频理解与数据高效GRPO reinforcement learning large language model
68 SmartCLIP: Modular Vision-language Alignment with Identification Guarantees 提出SmartCLIP以解决视觉与语言对齐问题 contrastive learning multimodal
69 Training Multi-Image Vision Agents via End2End Reinforcement Learning 提出IMAgent,通过端到端强化学习训练多图像视觉Agent,解决多图像QA任务。 reinforcement learning
70 Video Understanding: Through A Temporal Lens 通过时序视角提升视频理解能力,解决现有方法在时序关系建模上的不足。 contrastive learning egocentric
71 VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation VERTIGO:面向电影摄像机轨迹生成的视觉偏好优化框架 DPO direct preference optimization

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
72 UNICA: A Unified Neural Framework for Controllable 3D Avatars UNICA:用于可控3D化身的统一神经框架,简化角色创建流程。 motion planning 3D gaussian splatting gaussian splatting
73 A Unified Perspective on Adversarial Membership Manipulation in Vision Models 提出统一视角分析视觉模型中的对抗性成员操纵问题,并提出防御方法 manipulation
74 DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning DocShield:提出基于证据推理的AI文档安全框架,解决文本图像伪造问题。 manipulation chain-of-thought
75 ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction ReWeaver:提出拓扑精确的服装重建框架,适用于物理仿真。 manipulation sim-to-real

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
76 Exploring Motion-Language Alignment for Text-driven Motion Generation 提出MLA-Gen框架,通过运动-语言对齐提升文本驱动的人体动作生成质量。 text-to-motion text-driven motion motion generation
77 THOM: Generating Physically Plausible Hand-Object Meshes From Text 提出THOM框架,从文本生成具有物理合理性的手-物体交互3D网格模型 physically plausible contact-aware HOI

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
78 STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models 提出STEAR:层感知时空证据干预,缓解视频大语言模型中的幻觉问题 spatiotemporal large language model visual grounding
79 MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion MMTalker:基于多分辨率和多模态融合的3D说话头合成 spatiotemporal multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
80 SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors SING3R-SLAM:基于子地图的单目高斯SLAM,利用3D重建先验实现全局一致性室内场景重建 geometric consistency
81 SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions SDesc3D:提出一种布局感知的短文本驱动3D室内场景生成框架 spatial relationship

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
82 Motion Capture from Inertial and Vision Sensors 提出MINIONS数据集和SparseNet框架,实现基于惯性和视觉传感器的低成本人体运动捕捉。 SMPL human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页