cs.CV(2025-10-25)
📊 共 17 篇论文 | 🔗 3 篇有代码
🎯 兴趣领域导航
支柱二:RL算法与架构 (RL & Architecture) (6 🔗1)
支柱三:空间感知与语义 (Perception & Semantics) (5 🔗2)
支柱六:视频提取与匹配 (Video Extraction) (2)
支柱九:具身大模型 (Embodied Foundation Models) (2)
支柱四:生成式动作 (Generative Motion) (1)
支柱七:动作重定向 (Motion Retargeting) (1)
🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimer's Disease Diagnosis | 提出一种交叉增强的多模态融合框架,用于眼动追踪和面部特征的阿尔茨海默病诊断。 | representation learning multimodal | ||
| 2 | GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping | GRPO-Guard:通过调节裁剪缓解Flow Matching中的隐式过度优化 | reinforcement learning PPO flow matching | ||
| 3 | CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning | 提出CityRiSE,利用强化学习提升视觉-语言模型在城市社会经济地位推理中的能力 | reinforcement learning reward design | ||
| 4 | LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction | LOC:一种通用的语言引导框架,用于开放集3D occupancy预测 | contrastive learning distillation scene understanding | ||
| 5 | Beyond Augmentation: Leveraging Inter-Instance Relation in Self-Supervised Representation Learning | 提出基于图神经网络的自监督学习方法,利用实例间关系提升表征质量 | representation learning | ✅ | |
| 6 | LongCat-Video Technical Report | LongCat-Video:基于扩散Transformer的高效长视频生成模型 | RLHF world model |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model | EndoSfM3D:利用自监督基础模型学习内窥镜手术场景的3D重建 | depth estimation monocular depth Depth Anything | ✅ | |
| 8 | I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions | I2-NeRF:提出一种物理可信的神经辐射场,增强介质退化下的三维重建。 | NeRF neural radiance field | ||
| 9 | CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding | CogStereo:利用隐式空间认知嵌入的神经立体匹配,提升零样本泛化能力。 | monocular depth scene understanding scene flow | ||
| 10 | DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum | DynamicTree:利用稀疏体素谱实现交互式真实树木动画 | 3DGS gaussian splatting splatting | ||
| 11 | STG-Avatar: Animatable Human Avatars via Spacetime Gaussian | 提出STG-Avatar,通过时空高斯优化实现高保真可动画人体化身重建 | 3DGS optical flow | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents | WAGIBench:用于辅助可穿戴代理的自中心多模态目标推断基准 | egocentric multimodal | ||
| 13 | egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks | egoEMOTION:结合第一人称视觉与生理信号的情感与人格识别数据集 | egocentric egocentric vision |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Mitigating Coordinate Prediction Bias from Positional Encoding Failures | 针对MLLM坐标预测偏差,提出Vision-PE Shuffle Guidance方法提升定位精度 | large language model multimodal | ||
| 15 | HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models | 提出HARMONY,利用隐层激活和模型输出来提升视觉-语言模型的不确定性估计。 | multimodal |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | MOGRAS: Human Motion with Grasping in 3D Scenes | MOGRAS:提出大规模3D场景中人体抓取交互运动数据集与基准方法。 | physically plausible human-scene interaction |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation | GRAID:通过高质量数据生成增强视觉语言模型空间推理能力 | spatial relationship |