cs.CV(2025-06-05)

📊 共 69 篇论文 | 🔗 20 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (29 🔗8) 支柱三:空间感知与语义 (Perception & Semantics) (17 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (13 🔗3) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (29 篇)

#题目一句话要点标签🔗
1 MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning 提出MINT-CoT以解决多模态数学推理中的视觉信号整合问题 large language model multimodal chain-of-thought
2 From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes 提出Anywhere3D-Bench以解决3D场景中的多层次视觉定位问题 large language model multimodal visual grounding
3 Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics 提出多模态街道评估框架解决城市设计主观感知不足问题 large language model multimodal
4 Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations 提出STARE基准以评估多模态模型在视觉模拟中的空间认知能力 large language model multimodal
5 When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding 提出ZoomText与Grounded Layer Correction以缓解场景文本理解中的语义幻觉问题 multimodal
6 MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning 提出MORSE-500以解决多模态推理基准不足问题 multimodal
7 VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos 提出VideoMathQA以解决视频中的数学推理问题 multimodal
8 Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards? 利用基础模型提升身份证件的呈现攻击检测能力 foundation model
9 MokA: Multimodal Low-Rank Adaptation for MLLMs 提出MokA以解决多模态大语言模型的适应性问题 multimodal
10 Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis 提出TAPFM以解决病理基础模型在全切片图像分析中的适应性问题 foundation model
11 PixCell: A generative foundation model for digital histopathology images 提出PixCell以解决数字病理图像生成问题 foundation model
12 Deep histological synthesis from mass spectrometry imaging for multimodal registration 提出基于pix2pix模型的组织学图像合成以解决多模态配准问题 multimodal
13 BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models 提出BYO-Eval以解决多模态语言模型评估问题 multimodal
14 OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model 提出OpenMaskDINO3D以解决3D分割推理问题 large language model
15 SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs 提出SparseMM以优化多模态大语言模型的视觉理解效率 large language model multimodal
16 Towards Vision-Language-Garment Models for Web Knowledge Garment Understanding and Generation 提出VLG模型以解决服装生成领域的知识转移问题 foundation model multimodal
17 Quantifying Cross-Modality Memorization in Vision-Language Models 量化视觉语言模型中的跨模态记忆以提升知识迁移能力 large language model multimodal
18 A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions 综述越南文档分析与识别技术以应对独特挑战 large language model multimodal
19 TextVidBench: A Benchmark for Long Video Scene Text Understanding 提出TextVidBench以解决长视频场景文本理解问题 large language model multimodal
20 APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval 提出APVR以解决长视频理解中的信息检索问题 large language model multimodal
21 Refer to Any Segmentation Mask Group With Vision-Language Prompts 提出全模态参考表达分割以解决视觉语言交互不足问题 multimodal
22 Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos 提出Perceive Anything模型以解决图像和视频的区域理解问题 large language model
23 MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm 提出MonkeyOCR以解决文档解析效率与准确性问题 multimodal
24 SeedEdit 3.0: Fast and High-Quality Generative Image Editing 提出SeedEdit 3.0以解决高质量图像编辑问题 instruction following
25 FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing 提出FlowDirector以解决视频编辑中的逆向过程问题 instruction following
26 LLMs Can Compensate for Deficiencies in Visual Representations 提出视觉语言模型以弥补视觉表示的不足 multimodal
27 Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model 提出多维度评估模型以解决AI生成视频的视觉质量问题 large language model
28 Line of Sight: On Linear Representations in VLLMs 提出多模态稀疏自编码器以增强VLLM的图像表示能力 multimodal
29 HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model 提出HoliSafe以解决视觉语言模型安全性不足问题 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (17 篇)

#题目一句话要点标签🔗
30 Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer 提出基于3D高斯点云的立体数据集生成方法以提高模型泛化能力 3D gaussian splatting 3DGS gaussian splatting
31 Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting 提出PM-Loss以解决深度图导致的点云稀疏问题 3D gaussian splatting 3DGS gaussian splatting
32 Point Cloud Segmentation of Agricultural Vehicles using 3D Gaussian Splatting 提出3D高斯点云分割方法以解决农业车辆语义分割问题 3D gaussian splatting 3DGS gaussian splatting
33 UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting 提出UAV4D以解决无人机图像动态渲染问题 gaussian splatting splatting SMPL
34 VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction 提出VoxelSplat以解决动态环境下的占用与流预测问题 3D gaussian splatting gaussian splatting splatting
35 Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting 提出多尺度双边网格以提升动态驾驶场景重建精度 gaussian splatting splatting NeRF
36 Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning 提出DirectLayout以解决3D室内场景合成中的布局生成问题 open-vocabulary open vocabulary embodied AI
37 OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View 提出OGGSplat以解决稀疏视图下的3D场景重建问题 scene reconstruction open-vocabulary open vocabulary
38 On-the-fly Reconstruction for Large-Scale Novel View Synthesis from Unposed Images 提出一种即时重建方法以解决大规模新视角合成问题 3D gaussian splatting 3DGS gaussian splatting
39 Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos 提出分层运动融合以解决动态视频中的运动分割问题 neural radiance field egocentric
40 Structure-Aware Radar-Camera Depth Estimation 提出结构感知雷达-相机深度估计以解决稀疏噪声问题 depth estimation metric depth
41 SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning 提出SIV-Bench以解决社交互动理解与推理问题 scene understanding large language model multimodal
42 Gen-n-Val: Agentic Image Data Generation and Validation 提出Gen-n-Val框架以解决计算机视觉中的数据稀缺与标签噪声问题 open-vocabulary open vocabulary large language model
43 FreeTimeGS: Free Gaussian Primitives at Anytime and Anywhere for Dynamic Scene Reconstruction 提出FreeTimeGS以解决动态场景重建中的复杂运动问题 scene reconstruction
44 Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations 提出Object-X以解决多模态3D物体表示重建问题 3D gaussian splatting gaussian splatting splatting
45 Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth 提出Perfecting Depth框架以增强传感器深度数据的可靠性 metric depth
46 ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation 提出ProJo4D以解决稀疏视图逆物理估计问题 NeRF scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
47 Toward Better SSIM Loss for Unsupervised Monocular Depth Estimation 提出新型SSIM损失函数以改善无监督单目深度估计 MAE depth estimation monocular depth
48 Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains 提出MoViX以解决GPS缺失下的越野视频定位问题 contrastive learning spatiotemporal
49 Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion 提出iDPOE以解决内镜下粘膜剥离术轨迹预测问题 policy learning imitation learning diffusion policy
50 DM-SegNet: Dual-Mamba Architecture for 3D Medical Image Segmentation with Global Context Modeling 提出DM-SegNet以解决3D医学图像分割中的全局上下文建模问题 Mamba SSM state space model
51 Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions 提出Dream to Generalize以解决视觉干扰下的零-shot模型强化学习问题 reinforcement learning world model contrastive learning
52 AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs 提出CG-AV-Counting基准与AV-Reasoner模型以提升多模态计数能力 reinforcement learning curriculum learning multimodal
53 Video World Models with Long-term Spatial Memory 提出几何基础的长时空记忆以解决视频世界模型一致性问题 world model
54 From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos 提出TF-CoVR以解决细粒度视频检索问题 contrastive learning multimodal
55 LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs 提出LeanPO以解决视频大语言模型中的偏好对齐问题 DPO large language model
56 Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study 通过监督对比学习与战略插值提出棋类规划方法 contrastive learning
57 Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning 提出RAP方法以高效选择多模态推理中的高价值数据 reinforcement learning large language model
58 Robustness Evaluation for Video Models with Reinforcement Learning 提出多智能体强化学习方法以评估视频模型的鲁棒性 reinforcement learning
59 Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning 提出感知-推理解耦以解决多模态推理的可扩展性问题 reinforcement learning large language model

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
60 Towards Reliable Identification of Diffusion-based Image Manipulations 提出RADAR以解决基于扩散模型的图像篡改识别问题 manipulation foundation model
61 DSG-World: Learning a 3D Gaussian World Model from Dual State Videos 提出DSG-World以解决3D世界建模中的一致性问题 manipulation world model
62 Practical Manipulation Model for Robust Deepfake Detection 提出实用操控模型以增强深伪检测的鲁棒性 manipulation
63 Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations 提出基于架构设计的IQ模型以增强鲁棒性 manipulation
64 SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents 提出SmartAvatar以解决3D人类头像生成的精确控制问题 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
65 EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World? 提出EOC-Bench以解决动态自我中心视觉理解问题 egocentric egocentric vision large language model
66 VideoMolmo: Spatio-Temporal Grounding Meets Pointing 提出VideoMolmo以解决视频时空定位问题 egocentric egocentric vision large language model

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
67 Unleashing Hour-Scale Video Training for Long Video-Language Understanding 提出VideoMarathon数据集以解决长视频语言理解训练不足问题 spatiotemporal multimodal instruction following

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
68 EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh 提出EX-4D以解决极端视角视频合成问题 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
69 Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning 提出Follow-Your-Motion以解决视频运动转移中的不一致性问题 motion generation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页