cs.CV（2025-05-05）

📊 共 25 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (5) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Multimodal Deep Learning for Stroke Prediction and Detection using Retinal Imaging and Clinical Data	提出多模态深度学习方法以改善中风预测与检测	foundation model multimodal
2	AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation	提出解剖本体引导推理以提升胸部X光解读能力	multimodal
3	GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation	提出GAME以解决短视频中个性特征估计问题	multimodal
4	DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction	提出DeepSparse以解决稀疏视图CBCT重建中的高辐射和计算挑战	foundation model
5	Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models	提出VELM以解决工业异常分类问题	large language model
6	Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities	提出统一多模态理解与生成模型以解决独立演化问题	multimodal	✅
7	Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction	提出Ming-Lite-Uni以解决多模态交互统一架构问题	multimodal
8	Timing Is Everything: Finding the Optimal Fusion Points in Multimodal Medical Imaging	提出序列前向搜索算法以优化多模态医学影像融合时机	multimodal
9	Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection	提出图像-事件融合方法以解决视频异常检测中的时序信息不足问题	multimodal	✅
10	Using Knowledge Graphs to harvest datasets for efficient CLIP model training	利用知识图谱高效收集数据集以训练CLIP模型	foundation model
11	RGBX-DiffusionDet: A Framework for Multi-Modal RGB-X Object Detection Using DiffusionDet	提出RGBX-DiffusionDet以解决多模态目标检测问题	multimodal
12	Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey	提出基于CLIP的多模态OOD检测新框架以解决现有方法局限性	multimodal
13	TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment	提出TeDA以解决未知类别3D物体检索问题	multimodal	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
14	R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning	提出R1-Reward以解决多模态奖励建模不稳定问题	reinforcement learning reward design large language model
15	VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection	提出VAEmo以解决多模态情感识别中的数据稀缺与表达差异问题	representation learning contrastive learning large language model
16	Text to Image Generation and Editing: A Survey	全面综述文本到图像生成与编辑技术	Mamba classifier-free guidance foundation model
17	Learning 3D Persistent Embodied World Models	提出持久性具身世界模型以解决长远规划问题	policy learning world model
18	No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves	提出自我表征对齐方法以优化扩散变换器的生成质量	representation learning distillation foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
19	Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models	提出DiffuGTS以解决肿瘤分割的通用性问题	open-vocabulary open vocabulary	✅
20	VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery	提出VGLD以解决单目深度估计中的语言歧义问题	depth estimation monocular depth metric depth
21	6D Pose Estimation on Spoons and Hands	提出基于6D姿态估计的饮食监测系统以解决饮食行为分析问题	6D pose estimation
22	DELTA: Dense Depth from Events and LiDAR using Transformer's Attention	提出DELTA以融合事件相机与LiDAR数据解决深度估计问题	depth estimation

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
23	MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans	提出MetaScenes以解决真实世界3D场景复制问题	manipulation sim-to-real embodied AI	✅
24	Sim2Real in endoscopy segmentation with a novel structure aware image translation	提出一种新颖的图像翻译模型以解决内镜图像分割问题	sim2real

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation	提出Scenethesis以解决3D场景生成中的空间现实性问题	physically plausible penetration embodied AI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页