cs.CV(2025-06-20)
📊 共 20 篇论文 | 🔗 6 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗1)
支柱三:空间感知与语义 (Perception & Semantics) (5 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (3 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (3 🔗1)
支柱一:机器人控制 (Robot Control) (2)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You | 提出STRUCTURE以解决多模态对齐中的数据稀缺问题 | foundation model multimodal | ||
| 2 | When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network | 提出多模态异步混合网络以解决实时异常检测问题 | multimodal | ||
| 3 | MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation | 提出MEXA以解决多模态推理中的专家模型聚合问题 | multimodal | ||
| 4 | Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge | 提出MM-LG以高效提取CLIP中的多模态可泛化知识 | multimodal | ||
| 5 | LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation | 提出LaVi以解决视觉语言模型效率低下问题 | large language model multimodal | ||
| 6 | Do We Need Large VLMs for Spotting Soccer Actions? | 提出基于语言模型的足球动作识别方法以替代视频处理 | large language model | ||
| 7 | Multi-label Scene Classification for Autonomous Vehicles: Acquiring and Accumulating Knowledge from Diverse Datasets | 提出KAA-CAL以解决自动驾驶场景多标签分类问题 | foundation model | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 8 | Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting | 提出Part²GS以解决关节物体建模问题 | 3D gaussian splatting gaussian splatting splatting | ||
| 9 | DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches | 提出DepthVanish以优化立体深度估计中的对抗性补丁 | depth estimation stereo depth | ✅ | |
| 10 | RGBTrack: Fast, Robust Depth-Free 6D Pose Estimation and Tracking | 提出RGBTrack以解决实时6D姿态估计与跟踪问题 | 6D pose estimation | ✅ | |
| 11 | AnyTraverse: An off-road traversability framework with VLM and human operator in the loop | 提出AnyTraverse框架以解决复杂环境下的越野可通行性问题 | traversability | ||
| 12 | LunarLoc: Segment-Based Global Localization on the Moon | 提出LunarLoc以解决月球表面全球定位问题 | VIO | ✅ |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search | 提出MICS以解决医疗多模态大语言模型推理能力不足的问题 | curriculum learning large language model multimodal | ✅ | |
| 14 | RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought | 提出RealSR-R1以解决真实场景图像超分辨率问题 | reinforcement learning large language model chain-of-thought | ||
| 15 | UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation | 提出UniFork以解决多模态理解与生成中的任务干扰问题 | representation learning multimodal |
🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning | 提出VLN-R1以解决视觉-语言导航中的路径规划问题 | egocentric embodied AI VLN | ||
| 17 | Learning golf swing signatures from a single wrist-worn inertial sensor | 提出基于单个腕部传感器的高尔夫挥杆分析框架以解决现有方法不足 | human mesh recovery | ||
| 18 | Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes | 提出Co-VisiON基准以解决稀疏图像集中的共视推理问题 | feature matching | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens | 提出机器心理意象框架以增强多模态推理能力 | manipulation reinforcement learning distillation | ||
| 20 | Self-supervised Feature Extraction for Enhanced Ball Detection on Soccer Robots | 提出自监督特征提取方法以增强足球机器人中的球检测能力 | humanoid |