cs.CV（2025-09-12）

📊 共 22 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (7 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (7 🔗3) 支柱一：机器人控制 (Robot Control) (3) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱三：空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Towards Understanding Visual Grounding in Visual Language Models	综述视觉语言模型中的视觉定位技术，分析挑战与未来方向	multimodal visual grounding chain-of-thought
2	Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration	提出AVI-Math无人机图像数学推理基准，揭示现有VLM的局限性。	multimodal chain-of-thought	✅
3	A Comparison and Evaluation of Fine-tuned Convolutional Neural Networks to Large Language Models for Image Classification and Segmentation of Brain Tumors on MRI	对比微调LLM与CNN在脑肿瘤MRI图像分类与分割任务中的性能	large language model
4	MCL-AD: Multimodal Collaboration Learning for Zero-Shot 3D Anomaly Detection	MCL-AD：提出多模态协同学习框架，用于零样本3D异常检测	multimodal
5	SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation	SCOPE框架：语音引导的协同感知，用于手术场景分割	large language model foundation model
6	LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA	提出LaV-CoT框架，通过多方面奖励优化，解决真实世界多语言VQA问题。	multimodal chain-of-thought	✅
7	VARCO-VISION-2.0 Technical Report	VARCO-VISION-2.0：开源双语视觉语言模型，提升多模态理解与OCR能力	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
8	SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion	SignClip：利用口型线索的多模态对比融合手语翻译	contrastive learning large language model multimodal
9	Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses	构建通用SimCLR自监督脑MRI基础模型，提升3D脑部疾病诊断	masked autoencoder MAE foundation model	✅
10	OnlineHOI: Towards Online Human-Object Interaction Generation and Perception	提出OnlineHOI框架，用于在线人-物交互生成与感知任务	Mamba human-object interaction HOI
11	FLARE-SSM: Deep State Space Models with Influence-Balanced Loss for 72-Hour Solar Flare Prediction	提出FLARE-SSM模型，利用深度状态空间模型和影响力平衡损失进行72小时太阳耀斑预测。	SSM state space model
12	SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer's Prediction Tasks and Datasets	SSL-AD：时空自监督学习提升阿尔茨海默病预测任务的泛化性和适应性	contrastive learning spatiotemporal	✅
13	LayerLock: Non-collapsing Representation Learning with Progressive Freezing	LayerLock：通过渐进式冻结实现非坍塌的自监督表征学习	representation learning MAE
14	Efficient Learned Image Compression Through Knowledge Distillation	提出基于知识蒸馏的高效图像压缩方法，降低资源占用，提升实际应用性。	distillation	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation	提出一种免训练框架，通过LLM增强文本嵌入，提升扩散模型生成图像的颜色准确性。	manipulation spatial relationship large language model
16	Detecting Text Manipulation in Images using Vision Language Models	利用视觉语言模型检测图像中的文本篡改	manipulation
17	GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection	GAMMA：通过多任务和操纵增强训练实现AI生成图像检测的泛化对齐	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics	提出PixelHumor基准数据集，评估大型多模态模型对在线漫画幽默的理解能力	HuMoR multimodal
19	SCoDA: Self-supervised Continual Domain Adaptation	提出SCoDA，通过自监督和几何流形对齐实现免源持续领域自适应。	feature matching

🔬 支柱三：空间感知与语义 (Perception & Semantics) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
20	Multimodal SAM-adapter for Semantic Segmentation	提出MM SAM-adapter，用于提升多模态语义分割在复杂环境下的鲁棒性。	scene understanding multimodal	✅
21	On the Geometric Accuracy of Implicit and Primitive-based Representations Derived from View Rendering Constraints	针对空间机器人应用，对比隐式与显式新视角合成方法的几何精度	gaussian splatting splatting

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	USCTNet: A deep unfolding nuclear-norm optimization solver for physically consistent HSI reconstruction	USCTNet：用于物理一致性高光谱图像重建的深度展开核范数优化求解器	HSI	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页