cs.CV(2025-10-13)

📊 共 51 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (22 🔗7) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (4 🔗1) 支柱一:机器人控制 (Robot Control) (3) 支柱四:生成式动作 (Generative Motion) (3) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (22 篇)

#题目一句话要点标签🔗
1 AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model AndesVL:面向移动端的高效多模态大语言模型,实现性能与效率的平衡 large language model multimodal
2 InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models InternSVG:利用多模态大语言模型实现统一的SVG任务处理 large language model multimodal
3 FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models FlexAC:面向多模态大语言模型中联想推理的灵活控制 large language model multimodal
4 A Survey on Agentic Multimodal Large Language Models 综述Agentic多模态大语言模型,探索自主智能体在动态环境中的应用与发展。 large language model multimodal
5 BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models BLEnD-Vis:构建多模态文化理解基准,评估视觉语言模型中的文化知识鲁棒性。 multimodal visual grounding
6 CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images 提出CodePlot-CoT,通过代码驱动图像的思维链解决数学视觉推理难题 large language model multimodal chain-of-thought
7 ExpVid: A Benchmark for Experiment Video Understanding & Reasoning ExpVid:用于实验视频理解与推理的基准数据集,挑战多模态大语言模型在科学实验中的应用。 large language model multimodal visual grounding
8 MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis 提出MS-Mix以解决多模态情感分析中的数据稀缺问题 multimodal
9 Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping 评估基础模型在 hyperspectral 图像分类中的性能,应用于谷类作物类型识别。 foundation model
10 How many samples to label for an application given a foundation model? Chest X-ray classification study 研究胸部X光片分类任务中,如何利用预训练模型减少标注样本需求 foundation model
11 A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images 提出基于大语言模型的扫描电镜图像比例尺自动检测与提取框架 large language model
12 CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation CoPRS:提出基于思维链的位置先验学习方法,用于提升推理分割任务的性能与可解释性 chain-of-thought
13 Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning 提出SynTrans框架,利用大型多模态模型协同知识迁移提升少样本学习性能 multimodal
14 Mixup Helps Understanding Multimodal Video Better 提出多模态Mixup方法,提升多模态视频理解模型的泛化性和鲁棒性 multimodal
15 IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment IVEBench:用于指令引导视频编辑评估的现代基准套件 large language model multimodal
16 ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? 提出ODI-Bench,评估MLLM在全景图像理解中的能力并提出Omni-CoT方法。 large language model chain-of-thought
17 GIR-Bench: Versatile Benchmark for Generating Images with Reasoning 提出GIR-Bench以解决多模态模型评估不足问题 large language model multimodal
18 COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models 提出COCO-Tree,利用神经符号概念树增强视觉语言模型中的组合推理能力 large language model chain-of-thought
19 EvoCAD: Evolutionary CAD Code Generation with Vision Language Models EvoCAD:利用视觉语言模型与进化算法生成CAD代码 large language model
20 Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts 提出CLIP-SAM协同与级联提示的两阶段框架,提升零样本异常检测性能。 foundation model
21 IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation 提出IUT-Plug插件,通过显式结构化推理增强多模态图文生成中上下文一致性。 multimodal
22 FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model 提出FG-CLIP 2,用于提升英汉双语环境下的细粒度视觉-语言对齐能力 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
23 PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image PhySIC:从单张图像重建物理上合理的3D人-场景交互与接触 monocular depth scene understanding physically plausible
24 VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment VA-GS:通过视角对齐增强高斯溅射的几何表示,提升表面重建精度。 3D gaussian splatting gaussian splatting splatting
25 MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference 提出MaterialRefGS,通过多视角一致材质推断实现高质量反射高斯溅射渲染 gaussian splatting splatting
26 Ev4DGS: Novel-view Rendering of Non-Rigid Objects from Monocular Event Streams 提出Ev4DGS以解决单目事件流下非刚性物体的新视角渲染问题 3D gaussian splatting gaussian splatting splatting
27 Evaluating the effects of preprocessing, method selection, and hyperparameter tuning on SAR-based flood mapping and water depth estimation 研究预处理、方法选择和超参数调整对SAR洪水制图和水深估计的影响 depth estimation
28 DKPMV: Dense Keypoints Fusion from Multi-View RGB Frames for 6D Pose Estimation of Textureless Objects DKPMV:基于多视角RGB图像的稠密关键点融合,用于无纹理物体6D位姿估计 6D pose estimation
29 A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation 提出基于扩散模型的低成本训练数据生成框架,提升城市语义分割性能。 scene understanding semantic map
30 SNAP: Towards Segmenting Anything in Any Point Cloud 提出SNAP,一个通用的点云交互式分割模型,支持跨域和多种提示方式。 open-vocabulary open vocabulary
31 mmWalk: Towards Multi-modal Multi-view Walking Assistance mmWalk:面向盲人或低视力人群的多模态多视角步行辅助数据集与方法 scene understanding
32 REACT3D: Recovering Articulations for Interactive Physical 3D Scenes REACT3D:用于交互式物理3D场景的铰接结构恢复框架 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
33 G2L:From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Knowledge Distillation 提出G2L框架,通过知识蒸馏将千亿级病理模型能力迁移至癌症特异性大型模型。 distillation foundation model
34 Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning Vlaser:提出具有协同具身推理能力的视觉-语言-动作模型,弥合VLM推理与VLA策略学习的鸿沟。 policy learning vision-language-action VLA
35 High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation 提出基于全局-局部状态空间模型的高分辨率时空建模方法,用于视频人体姿态估计。 Mamba state space model spatiotemporal
36 Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos 提出基于类原型对比学习的多标签细粒度教育视频分类方法 contrastive learning multimodal
37 Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning 提出Chart-RVR框架,通过可验证奖励的强化学习提升图表推理的可解释性和鲁棒性 reinforcement learning chain-of-thought
38 Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment 提出RALI算法,通过对比学习对齐图像和文本表征,实现高效图像质量评估。 reinforcement learning contrastive learning
39 Topological Alignment of Shared Vision-Language Embedding Space 提出ToMCLIP,通过拓扑对齐增强多语言视觉-语言模型的共享嵌入空间。 representation learning multimodal
40 Source-Free Object Detection with Detection Transformer 提出FRANCK框架,通过查询中心特征增强实现DETR的无源域目标检测。 contrastive learning distillation

🔬 支柱六:视频提取与匹配 (Video Extraction) (4 篇)

#题目一句话要点标签🔗
41 Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model 提出Situat3DChange数据集,用于多模态大语言模型理解情境化3D场景变化 egocentric large language model multimodal
42 FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding FastHMR:通过Token和层合并及扩散解码加速人体网格重建 human mesh recovery HMR
43 ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training ACE-G:通过查询预训练提升场景坐标回归的泛化能力 feature matching
44 Robust Ego-Exo Correspondence with Long-Term Memory 提出基于长时记忆的LM-EEC框架,解决Ego-Exo视角对应中的特征融合和记忆容量问题。 egocentric

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
45 Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View 提出一种类别无关的单视图物体位姿、尺寸和形状估计框架。 manipulation embodied AI foundation model
46 CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization 提出CoDefend,通过扩散净化和提示优化协同防御多模态大语言模型的对抗攻击。 manipulation large language model multimodal
47 Zero-shot Face Editing via ID-Attribute Decoupled Inversion 提出基于ID-属性解耦反演的零样本人脸编辑方法,解决ID保持和结构一致性问题。 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
48 MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps 提出基于运动地图(MoMap)的语义感知场景运动生成方法,实现从单张图像预测未来3D场景运动。 motion generation
49 Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers 提出Detail Guidance,通过调控Diffusion Transformer中的大规模激活提升图像细节生成质量 classifier-free guidance
50 LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference 提出LikePhys,通过似然偏好评估视频扩散模型中的直观物理理解能力 physically plausible

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
51 Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment DiPro:时空解耦与多尺度对齐的多模态疾病进展建模框架 spatiotemporal multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页