| 1 |
TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval |
提出TRACE,通过任务自适应推理和表征学习实现通用多模态检索 |
representation learning large language model multimodal |
|
|
| 2 |
Chain of World: World Model Thinking in Latent Motion |
提出Chain-of-World VLA模型,解决具身智能中视觉动态预测与时序因果建模问题。 |
world model latent dynamics motion latent |
✅ |
|
| 3 |
VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning |
提出VSeacher,通过强化学习赋能多模态模型,使其具备长程多轮Web搜索能力。 |
reinforcement learning large language model multimodal |
|
|
| 4 |
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization |
提出MoD-DPO,通过解耦模态偏好优化缓解全模态LLM中的跨模态幻觉问题 |
DPO direct preference optimization large language model |
|
|
| 5 |
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation |
提出通用知识蒸馏GKD,提升语义分割模型在跨域泛化能力 |
representation learning distillation foundation model |
✅ |
|
| 6 |
Beyond Language Modeling: An Exploration of Multimodal Pretraining |
探索多模态预训练:超越语言建模,实现视觉与语言的协同 |
world model foundation model multimodal |
|
|
| 7 |
Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting |
MVD-HuGaS:基于多视角扩散模型和高斯溅射的单图三维人体重建 |
distillation gaussian splatting splatting |
|
|
| 8 |
Kling-MotionControl Technical Report |
Kling-MotionControl:基于DiT的统一框架,实现鲁棒、精确、富有表现力的人物动画 |
distillation motion retargeting motion representation |
|
|
| 9 |
Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective |
提出IB-IUMAD,解决增量统一多模态异常检测中的灾难性遗忘问题 |
Mamba multimodal |
|
|
| 10 |
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data |
提出SGMA框架,解决遥感不完整多模态数据语义分割中的模态不平衡问题。 |
contrastive learning multimodal |
|
|
| 11 |
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing |
提出RL3DEdit,通过几何引导强化学习实现多视角一致的三维场景编辑 |
reinforcement learning VGGT foundation model |
|
|
| 12 |
CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration |
提出CAWM-Mamba,用于红外-可见光图像融合和复杂恶劣天气恢复的统一模型 |
Mamba SSM multimodal |
✅ |
|
| 13 |
Specificity-aware reinforcement learning for fine-grained open-world classification |
提出SpeciaRL,解决开放世界细粒度分类中LMMs预测泛化问题 |
reinforcement learning multimodal |
✅ |
|
| 14 |
From "What" to "How": Constrained Reasoning for Autoregressive Image Generation |
提出CoR-Painter,通过约束推理指导自回归图像生成,解决空间歧义问题。 |
reinforcement learning spatial relationship chain-of-thought |
|
|
| 15 |
ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling |
ShareVerse:提出多智能体一致性视频生成框架,用于共享世界建模 |
world model geometric consistency |
|
|
| 16 |
DREAM: Where Visual Understanding Meets Text-to-Image Generation |
DREAM:融合视觉理解与文本到图像生成的统一框架 |
representation learning depth estimation multimodal |
|
|
| 17 |
ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion |
ITO:通过协同多重对齐和训练时融合,实现图像和文本的统一表示 |
representation learning contrastive learning multimodal |
|
|
| 18 |
NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining |
NeighborMAE:利用邻域遥感影像空间依赖性的掩码自编码器预训练 |
masked autoencoder |
|
|