cs.CV(2025-06-04)
📊 共 17 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (6)
支柱九:具身大模型 (Embodied Foundation Models) (4 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (3)
支柱二:RL算法与架构 (RL & Architecture) (2)
支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)
支柱一:机器人控制 (Robot Control) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting | 提出FlexGS以解决3D高斯点云渲染内存限制问题 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 2 | Photoreal Scene Reconstruction from an Egocentric Device | 提出视觉惯性束调整以解决滚动快门相机重建问题 | gaussian splatting splatting scene reconstruction | ||
| 3 | HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting | 提出HuGeDiff以解决3D人类生成的控制与细节问题 | gaussian splatting splatting neural radiance field | ||
| 4 | UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation | 提出UniCUE框架以解决中文手语视频到语音生成问题 | semantic mapping semantic map multimodal | ||
| 5 | Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation | 提出Voyager以解决长距离一致性3D场景生成问题 | metric depth | ||
| 6 | GlobalBuildingAtlas: An Open Global and Complete Dataset of Building Polygons, Heights and LoD1 3D Models | 提出GlobalBuildingAtlas以解决全球建筑数据缺乏问题 | height map |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos | 提出MMR-V以解决多模态视频推理的挑战 | large language model multimodal chain-of-thought | ||
| 8 | Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization | 提出实体中心多模态偏好优化以解决大视觉语言模型的幻觉问题 | large language model multimodal | ||
| 9 | Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning | 提出Rex-Thinker以解决对象指称的可解释性与可靠性问题 | chain-of-thought | ||
| 10 | ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding | 提出ReXVQA以解决胸部X光视觉问答基准问题 | large language model multimodal | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs | 提出Struct2D框架以解决MLLMs空间推理问题 | egocentric large language model multimodal | ||
| 12 | Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset | 提出Oxford Day-and-Night数据集以解决夜间视觉重定位问题 | egocentric | ||
| 13 | SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing | 提出SAVVY以解决动态3D空间推理问题 | egocentric large language model |
🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Language-Image Alignment with Fixed Text Encoders | 提出LIFT方法以简化语言-图像对齐过程 | representation learning contrastive learning large language model | ||
| 15 | Object-level Self-Distillation for Vision Pretraining | 提出对象级自蒸馏方法以解决图像级自蒸馏局限性 | distillation |
🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | Zero-Shot Temporal Interaction Localization for Egocentric Videos | 提出EgoLoc以解决自我中心视频中的时序交互定位问题 | human-object interaction HOI egocentric | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning | 提出WorldPrediction基准以解决高层次世界建模与长远规划问题 | motion planning world model |