| 1 |
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models |
提出Impromptu VLA以解决自动驾驶中的视觉-语言-动作模型挑战 |
vision-language-action VLA |
✅ |
|
| 2 |
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought |
提出Argus以解决视觉推理中的注意力不足问题 |
large language model multimodal chain-of-thought |
✅ |
|
| 3 |
Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model |
提出预防性幻觉减少方法以解决多模态语言模型的幻觉问题 |
large language model multimodal |
|
|
| 4 |
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation |
提出OpenUni以实现多模态理解与生成的统一 |
large language model multimodal |
✅ |
|
| 5 |
MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking |
提出MaskAdapt以解决农业领域无监督域适应问题 |
multimodal |
|
|
| 6 |
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence |
提出Spatial-MLLM以解决视觉基础空间智能问题 |
large language model foundation model multimodal |
✅ |
|
| 7 |
FMG-Det: Foundation Model Guided Robust Object Detection |
提出FMG-Det以解决噪声标注下的物体检测问题 |
foundation model |
|
|
| 8 |
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos |
提出VF-Eval以评估多模态LLM在AIGC视频反馈生成中的能力 |
multimodal |
|
|
| 9 |
EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis |
提出EndoBench以解决内窥镜分析多模态模型评估不足问题 |
large language model |
|
|
| 10 |
OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data |
提出OmniEarth-Bench以解决地球六大圈层及其交互的评估问题 |
multimodal |
|
|
| 11 |
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning |
提出VAU-R1以解决视频异常理解中的推理能力不足问题 |
large language model multimodal chain-of-thought |
✅ |
|
| 12 |
MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification |
提出MCFNet以解决多模态信息融合中的细粒度语义分类问题 |
multimodal |
|
|
| 13 |
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? |
提出VideoReasonBench以解决视频理解中的复杂推理问题 |
large language model multimodal chain-of-thought |
|
|
| 14 |
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks |
提出ThinkGeo以评估工具增强代理在遥感任务中的表现 |
large language model multimodal |
|
|
| 15 |
Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications |
提出元数据增强模型以解决文化遗产数字化中的元数据不足问题 |
large language model TAMP |
|
|
| 16 |
CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection |
提出CMIE框架以解决多模态大语言模型在虚假信息检测中的不足 |
large language model multimodal |
|
|
| 17 |
Vid-SME: Membership Inference Attacks against Large Video Understanding Models |
提出Vid-SME以解决视频理解模型的成员推断攻击问题 |
large language model multimodal |
|
|
| 18 |
DGIQA: Depth-guided Feature Attention and Refinement for Generalizable Image Quality Assessment |
提出DGIQA以解决无参考图像质量评估中的泛化问题 |
multimodal |
|
|
| 19 |
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL |
提出VisualSphinx以解决视觉语言模型训练数据不足问题 |
multimodal |
|
|
| 20 |
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding |
提出ScaleLong基准以解决长视频理解中的多时间尺度问题 |
multimodal |
✅ |
|
| 21 |
D-AR: Diffusion via Autoregressive Models |
提出D-AR以重构图像扩散过程为自回归模型 |
large language model |
✅ |
|
| 22 |
ZeroSep: Separate Anything in Audio with Zero Training |
提出ZeroSep以实现音频源的零训练分离 |
foundation model |
|
|
| 23 |
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition |
提出Uni-MuMER以解决手写数学表达式识别问题 |
chain-of-thought |
✅ |
|
| 24 |
TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models |
提出TerraIncognita以解决昆虫物种发现的挑战 |
multimodal |
✅ |
|
| 25 |
VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation |
提出VCapsBench以解决视频字幕质量评估不足问题 |
large language model |
✅ |
|