cs.CV(2025-09-12)

📊 共 22 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (7 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗3) 支柱一:机器人控制 (Robot Control) (3) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
1 Towards Understanding Visual Grounding in Visual Language Models 综述视觉语言模型中的视觉定位技术,分析挑战与未来方向 multimodal visual grounding chain-of-thought
2 Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration 提出AVI-Math无人机图像数学推理基准,揭示现有VLM的局限性。 multimodal chain-of-thought
3 A Comparison and Evaluation of Fine-tuned Convolutional Neural Networks to Large Language Models for Image Classification and Segmentation of Brain Tumors on MRI 对比微调LLM与CNN在脑肿瘤MRI图像分类与分割任务中的性能 large language model
4 MCL-AD: Multimodal Collaboration Learning for Zero-Shot 3D Anomaly Detection MCL-AD:提出多模态协同学习框架,用于零样本3D异常检测 multimodal
5 SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation SCOPE框架:语音引导的协同感知,用于手术场景分割 large language model foundation model
6 LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA 提出LaV-CoT框架,通过多方面奖励优化,解决真实世界多语言VQA问题。 multimodal chain-of-thought
7 VARCO-VISION-2.0 Technical Report VARCO-VISION-2.0:开源双语视觉语言模型,提升多模态理解与OCR能力 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
8 SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion SignClip:利用口型线索的多模态对比融合手语翻译 contrastive learning large language model multimodal
9 Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses 构建通用SimCLR自监督脑MRI基础模型,提升3D脑部疾病诊断 masked autoencoder MAE foundation model
10 OnlineHOI: Towards Online Human-Object Interaction Generation and Perception 提出OnlineHOI框架,用于在线人-物交互生成与感知任务 Mamba human-object interaction HOI
11 FLARE-SSM: Deep State Space Models with Influence-Balanced Loss for 72-Hour Solar Flare Prediction 提出FLARE-SSM模型,利用深度状态空间模型和影响力平衡损失进行72小时太阳耀斑预测。 SSM state space model
12 SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer's Prediction Tasks and Datasets SSL-AD:时空自监督学习提升阿尔茨海默病预测任务的泛化性和适应性 contrastive learning spatiotemporal
13 LayerLock: Non-collapsing Representation Learning with Progressive Freezing LayerLock:通过渐进式冻结实现非坍塌的自监督表征学习 representation learning MAE
14 Efficient Learned Image Compression Through Knowledge Distillation 提出基于知识蒸馏的高效图像压缩方法,降低资源占用,提升实际应用性。 distillation

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
15 Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation 提出一种免训练框架,通过LLM增强文本嵌入,提升扩散模型生成图像的颜色准确性。 manipulation spatial relationship large language model
16 Detecting Text Manipulation in Images using Vision Language Models 利用视觉语言模型检测图像中的文本篡改 manipulation
17 GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection GAMMA:通过多任务和操纵增强训练实现AI生成图像检测的泛化对齐 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
18 Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics 提出PixelHumor基准数据集,评估大型多模态模型对在线漫画幽默的理解能力 HuMoR multimodal
19 SCoDA: Self-supervised Continual Domain Adaptation 提出SCoDA,通过自监督和几何流形对齐实现免源持续领域自适应。 feature matching

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
20 Multimodal SAM-adapter for Semantic Segmentation 提出MM SAM-adapter,用于提升多模态语义分割在复杂环境下的鲁棒性。 scene understanding multimodal
21 On the Geometric Accuracy of Implicit and Primitive-based Representations Derived from View Rendering Constraints 针对空间机器人应用,对比隐式与显式新视角合成方法的几何精度 gaussian splatting splatting

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
22 USCTNet: A deep unfolding nuclear-norm optimization solver for physically consistent HSI reconstruction USCTNet:用于物理一致性高光谱图像重建的深度展开核范数优化求解器 HSI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页