cs.CV（2025-10-13）

📊 共 51 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (22 🔗7) 支柱三：空间感知与语义 (Perception & Semantics) (10 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (4 🔗1) 支柱一：机器人控制 (Robot Control) (3) 支柱四：生成式动作 (Generative Motion) (3) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (22 篇)

#	题目	一句话要点	标签	🔗
1	AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model	AndesVL：面向移动端的高效多模态大语言模型，实现性能与效率的平衡	large language model multimodal	✅
2	InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models	InternSVG：利用多模态大语言模型实现统一的SVG任务处理	large language model multimodal
3	FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models	FlexAC：面向多模态大语言模型中联想推理的灵活控制	large language model multimodal	✅
4	A Survey on Agentic Multimodal Large Language Models	综述Agentic多模态大语言模型，探索自主智能体在动态环境中的应用与发展。	large language model multimodal	✅
5	BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models	BLEnD-Vis：构建多模态文化理解基准，评估视觉语言模型中的文化知识鲁棒性。	multimodal visual grounding
6	CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images	提出CodePlot-CoT，通过代码驱动图像的思维链解决数学视觉推理难题	large language model multimodal chain-of-thought	✅
7	ExpVid: A Benchmark for Experiment Video Understanding & Reasoning	ExpVid：用于实验视频理解与推理的基准数据集，挑战多模态大语言模型在科学实验中的应用。	large language model multimodal visual grounding
8	MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis	提出MS-Mix以解决多模态情感分析中的数据稀缺问题	multimodal	✅
9	Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping	评估基础模型在 hyperspectral 图像分类中的性能，应用于谷类作物类型识别。	foundation model
10	How many samples to label for an application given a foundation model? Chest X-ray classification study	研究胸部X光片分类任务中，如何利用预训练模型减少标注样本需求	foundation model
11	A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images	提出基于大语言模型的扫描电镜图像比例尺自动检测与提取框架	large language model
12	CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation	CoPRS：提出基于思维链的位置先验学习方法，用于提升推理分割任务的性能与可解释性	chain-of-thought	✅
13	Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning	提出SynTrans框架，利用大型多模态模型协同知识迁移提升少样本学习性能	multimodal
14	Mixup Helps Understanding Multimodal Video Better	提出多模态Mixup方法，提升多模态视频理解模型的泛化性和鲁棒性	multimodal
15	IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment	IVEBench：用于指令引导视频编辑评估的现代基准套件	large language model multimodal
16	ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?	提出ODI-Bench，评估MLLM在全景图像理解中的能力并提出Omni-CoT方法。	large language model chain-of-thought
17	GIR-Bench: Versatile Benchmark for Generating Images with Reasoning	提出GIR-Bench以解决多模态模型评估不足问题	large language model multimodal	✅
18	COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models	提出COCO-Tree，利用神经符号概念树增强视觉语言模型中的组合推理能力	large language model chain-of-thought
19	EvoCAD: Evolutionary CAD Code Generation with Vision Language Models	EvoCAD：利用视觉语言模型与进化算法生成CAD代码	large language model
20	Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts	提出CLIP-SAM协同与级联提示的两阶段框架，提升零样本异常检测性能。	foundation model
21	IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation	提出IUT-Plug插件，通过显式结构化推理增强多模态图文生成中上下文一致性。	multimodal
22	FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model	提出FG-CLIP 2，用于提升英汉双语环境下的细粒度视觉-语言对齐能力	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

#	题目	一句话要点	标签	🔗
23	PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image	PhySIC：从单张图像重建物理上合理的3D人-场景交互与接触	monocular depth scene understanding physically plausible
24	VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment	VA-GS：通过视角对齐增强高斯溅射的几何表示，提升表面重建精度。	3D gaussian splatting gaussian splatting splatting	✅
25	MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference	提出MaterialRefGS，通过多视角一致材质推断实现高质量反射高斯溅射渲染	gaussian splatting splatting
26	Ev4DGS: Novel-view Rendering of Non-Rigid Objects from Monocular Event Streams	提出Ev4DGS以解决单目事件流下非刚性物体的新视角渲染问题	3D gaussian splatting gaussian splatting splatting
27	Evaluating the effects of preprocessing, method selection, and hyperparameter tuning on SAR-based flood mapping and water depth estimation	研究预处理、方法选择和超参数调整对SAR洪水制图和水深估计的影响	depth estimation
28	DKPMV: Dense Keypoints Fusion from Multi-View RGB Frames for 6D Pose Estimation of Textureless Objects	DKPMV：基于多视角RGB图像的稠密关键点融合，用于无纹理物体6D位姿估计	6D pose estimation
29	A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation	提出基于扩散模型的低成本训练数据生成框架，提升城市语义分割性能。	scene understanding semantic map
30	SNAP: Towards Segmenting Anything in Any Point Cloud	提出SNAP，一个通用的点云交互式分割模型，支持跨域和多种提示方式。	open-vocabulary open vocabulary	✅
31	mmWalk: Towards Multi-modal Multi-view Walking Assistance	mmWalk：面向盲人或低视力人群的多模态多视角步行辅助数据集与方法	scene understanding
32	REACT3D: Recovering Articulations for Interactive Physical 3D Scenes	REACT3D：用于交互式物理3D场景的铰接结构恢复框架	scene understanding

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗
33	G2L:From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Knowledge Distillation	提出G2L框架，通过知识蒸馏将千亿级病理模型能力迁移至癌症特异性大型模型。	distillation foundation model
34	Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning	Vlaser：提出具有协同具身推理能力的视觉-语言-动作模型，弥合VLM推理与VLA策略学习的鸿沟。	policy learning vision-language-action VLA
35	High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation	提出基于全局-局部状态空间模型的高分辨率时空建模方法，用于视频人体姿态估计。	Mamba state space model spatiotemporal
36	Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos	提出基于类原型对比学习的多标签细粒度教育视频分类方法	contrastive learning multimodal	✅
37	Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning	提出Chart-RVR框架，通过可验证奖励的强化学习提升图表推理的可解释性和鲁棒性	reinforcement learning chain-of-thought
38	Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment	提出RALI算法，通过对比学习对齐图像和文本表征，实现高效图像质量评估。	reinforcement learning contrastive learning
39	Topological Alignment of Shared Vision-Language Embedding Space	提出ToMCLIP，通过拓扑对齐增强多语言视觉-语言模型的共享嵌入空间。	representation learning multimodal
40	Source-Free Object Detection with Detection Transformer	提出FRANCK框架，通过查询中心特征增强实现DETR的无源域目标检测。	contrastive learning distillation

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

#	题目	一句话要点	标签	🔗
41	Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model	提出Situat3DChange数据集，用于多模态大语言模型理解情境化3D场景变化	egocentric large language model multimodal
42	FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding	FastHMR：通过Token和层合并及扩散解码加速人体网格重建	human mesh recovery HMR
43	ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training	ACE-G：通过查询预训练提升场景坐标回归的泛化能力	feature matching
44	Robust Ego-Exo Correspondence with Long-Term Memory	提出基于长时记忆的LM-EEC框架，解决Ego-Exo视角对应中的特征融合和记忆容量问题。	egocentric	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签
45	Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View	提出一种类别无关的单视图物体位姿、尺寸和形状估计框架。	manipulation embodied AI foundation model
46	CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization	提出CoDefend，通过扩散净化和提示优化协同防御多模态大语言模型的对抗攻击。	manipulation large language model multimodal
47	Zero-shot Face Editing via ID-Attribute Decoupled Inversion	提出基于ID-属性解耦反演的零样本人脸编辑方法，解决ID保持和结构一致性问题。	manipulation

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签
48	MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps	提出基于运动地图（MoMap）的语义感知场景运动生成方法，实现从单张图像预测未来3D场景运动。	motion generation
49	Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers	提出Detail Guidance，通过调控Diffusion Transformer中的大规模激活提升图像细节生成质量	classifier-free guidance
50	LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference	提出LikePhys，通过似然偏好评估视频扩散模型中的直观物理理解能力	physically plausible

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
51	Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment	DiPro：时空解耦与多尺度对齐的多模态疾病进展建模框架	spatiotemporal multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-10-13）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (22 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册