cs.CV(2025-09-10)

📊 共 27 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation COCO-Urdu:构建大规模乌尔都语图像描述数据集,并提出多模态质量评估框架。 large language model multimodal visual grounding
2 Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles 提出MMB方法,通过多模态贝叶斯提示集成校准MLLM在文图生成评判中的偏差。 large language model multimodal
3 Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation 提出基于EM聚合和文本引导解码的视觉-语言语义聚合方法,提升医学图像分割的泛化性。 foundation model multimodal
4 MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance 提出大规模多模态智能交通监控数据集MITS,提升LMM在ITS领域的性能 multimodal instruction following
5 An Open Benchmark Dataset for GeoAI Foundation Models for Oil Palm Mapping in Indonesia 发布印尼油棕榈测绘GeoAI基础模型开放基准数据集,助力可持续发展。 foundation model PaLM-E
6 Recurrence Meets Transformers for Universal Multimodal Retrieval 提出ReT-2,一种支持多模态查询的通用多模态检索模型。 multimodal
7 Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis 提出检索增强的视觉-语言模型,用于提升多模态黑色素瘤诊断的准确性。 multimodal
8 A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval 提出多模态RAG框架,用于灾后房屋损伤评估,协同优化图像编码和策略向量检索。 multimodal
9 BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion 提出基于BreezeCLIP的BcQLM轻量级MLLM框架,用于高效视觉语言理解。 large language model multimodal
10 AdsQA: Towards Advertisement Video Understanding 提出AdsQA广告视频理解基准,并设计ReAd-R模型提升LLM在广告领域的应用能力。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
11 PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability PromptGuard:针对弱势群体,通过编排式Prompting框架提升LLM生成文本的安全性、公平性和可控性 contrastive learning large language model chain-of-thought
12 First-order State Space Model for Lightweight Image Super-resolution 提出一阶状态空间模型FSSM,用于轻量级图像超分辨率任务,无需额外参数即可提升性能。 Mamba SSM state space model
13 SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training SimCroP:基于相似性驱动的跨粒度预训练提升胸部CT影像表征学习 representation learning multimodal
14 World Modeling with Probabilistic Structure Integration 提出概率结构集成(PSI),用于学习可控且灵活提示的世界模型。 world model optical flow
15 RewardDance: Reward Scaling in Visual Generation RewardDance:通过生成式奖励建模解决视觉生成中的奖励缩放与奖励利用问题 reinforcement learning RLHF chain-of-thought
16 Hyperspectral Mamba for Hyperspectral Object Tracking 提出基于Mamba的HyMamba网络,用于高光谱目标跟踪,提升复杂场景下的跟踪精度。 Mamba SSM
17 Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening 提出基于潜在空间矫正的时间感知视频表征学习方法,用于手性动作识别。 representation learning
18 Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video 提出基于码率控制扩散模型的视频解耦框架,用于分离视频中的运动和内容 representation learning motion generation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
19 Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation 提出基于提示的多模态生成AI图像分析流程,实现检测、分割、修复与描述。 open-vocabulary open vocabulary multimodal
20 FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization 提出FractalPINN-Flow,一种基于分形网络的无监督光流估计方法。 optical flow
21 UltrON: Ultrasound Occupancy Networks UltrON:利用声学特征的超声图像占据网络,解决弱监督下的三维重建问题 implicit representation geometric consistency
22 Semantic Causality-Aware Vision-Based 3D Occupancy Prediction 提出语义因果感知的3D Occupancy预测方法,解决2D到3D转换中的误差累积问题。 semantic mapping semantic map

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
23 iMatcher: Improve matching in point cloud registration via local-to-global geometric consistency learning iMatcher:通过局部到全局几何一致性学习改进点云配准中的特征匹配 feature matching geometric consistency
24 Diffusion-Based Action Recognition Generalizes to Untrained Domains 提出基于扩散模型的动作识别方法,提升模型在未训练域上的泛化能力 egocentric

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
25 EfficientIML: Efficient High-Resolution Image Manipulation Localization 提出EfficientIML模型,高效定位高分辨率图像中基于扩散模型的篡改区域。 manipulation
26 ArgoTweak: Towards Self-Updating HD Maps through Structured Priors ArgoTweak:通过结构化先验实现高精地图的自更新 sim2real

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
27 HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning HuMo:通过协同多模态条件控制实现以人为中心的视频生成 classifier-free guidance foundation model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页