| 1 |
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning |
提出UniVG-R1以解决复杂多模态视觉定位问题 |
reinforcement learning large language model multimodal |
✅ |
|
| 2 |
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation |
提出UniGen以解决多模态理解与生成的挑战 |
direct preference optimization large language model multimodal |
|
|
| 3 |
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning |
提出Visionary-R1以解决视觉推理中的快捷学习问题 |
reinforcement learning large language model multimodal |
|
|
| 4 |
Programmatic Video Prediction Using Large Language Models |
提出ProgGen以解决视频帧预测问题 |
world model large language model |
|
|
| 5 |
Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency |
提出TemRobBench与PanoDPO以解决多模态模型的时间一致性问题 |
direct preference optimization multimodal |
|
|
| 6 |
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank |
提出VisualQuality-R1以解决图像质量评估中的推理不足问题 |
reinforcement learning large language model |
|
|
| 7 |
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning |
提出DeepEyes以解决多模态推理中的视觉与文本整合问题 |
reinforcement learning multimodal |
✅ |
|
| 8 |
Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method |
提出OmniVQA数据集与360-R1方法以解决全景视觉问答问题 |
reinforcement learning embodied AI large language model |
|
|
| 9 |
StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning |
提出StPR框架以解决视频类增量学习中的遗忘问题 |
distillation spatiotemporal |
|
|
| 10 |
Intra-class Patch Swap for Self-Distillation |
提出基于类内补丁交换的自蒸馏方法以简化知识蒸馏 |
teacher-student distillation |
✅ |
|
| 11 |
MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks |
提出MultiMAE以解决多模态地球观测任务的预训练问题 |
masked autoencoder |
✅ |
|
| 12 |
RETRO: REthinking Tactile Representation Learning with Material PriOrs |
提出材料先验以提升触觉表征学习的准确性 |
representation learning |
|
|
| 13 |
Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search |
提出符号图排序器以解决会话搜索中的信息结构建模问题 |
contrastive learning large language model |
|
|
| 14 |
Scaling Vision Mamba Across Resolutions via Fractal Traversal |
提出FractalMamba++以解决视觉输入分辨率适应性问题 |
Mamba |
|
|
| 15 |
Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning |
提出物理驱动的局部-整体弹性变形建模以提升点云表示学习 |
representation learning |
|
|
| 16 |
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels |
提出Ground-V以解决复杂指令的像素级定位问题 |
distillation instruction following |
|
|