| 1 |
FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology |
FishNet++:评估多模态大语言模型在海洋生物学中的能力,并构建大规模多模态基准 |
large language model multimodal |
|
|
| 2 |
MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment |
提出MMRQA框架,融合信号处理与多模态大语言模型,用于MRI质量评估。 |
large language model multimodal |
|
|
| 3 |
Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation |
提出ViPET-ReportGen数据集与基准,用于提升越南语PET/CT报告生成的视觉-语言基础模型性能 |
foundation model multimodal |
✅ |
|
| 4 |
LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models |
LLM-RG:利用大语言模型实现户外场景下的指称对象定位 |
large language model chain-of-thought |
|
|
| 5 |
GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs |
GHOST:通过诱导幻觉的图像生成方法,用于压力测试多模态LLM |
large language model multimodal |
|
|
| 6 |
Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding |
提出层对比解码(LayerCD)方法,缓解多模态大语言模型中的幻觉问题。 |
large language model multimodal |
✅ |
|
| 7 |
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding |
提出OIG-Bench基准,用于评估多模态大语言模型对单图引导的理解能力 |
large language model multimodal |
✅ |
|
| 8 |
Vision Function Layer in Multimodal LLMs |
揭示多模态LLM视觉功能层,实现高效可定制的视觉能力 |
large language model multimodal |
|
|
| 9 |
Multimodal Arabic Captioning with Interpretable Visual Concept Integration |
VLCAP:一种结合可解释视觉概念集成的多模态阿拉伯语图像描述框架 |
multimodal |
|
|
| 10 |
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning |
VideoAnchor:通过强化子空间结构视觉线索实现连贯的视觉-空间推理 |
large language model multimodal visual grounding |
✅ |
|
| 11 |
A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration |
提出FFDP框架,实现前所未有的十亿体素多模态图像配准 |
multimodal |
|
|
| 12 |
Robust Multimodal Semantic Segmentation with Balanced Modality Contributions |
提出EQUISeg,通过平衡模态贡献提升多模态语义分割的鲁棒性 |
multimodal |
|
|
| 13 |
Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models |
提出Uni-X模型,通过两端分离架构缓解多模态统一模型中的模态冲突问题 |
multimodal |
✅ |
|
| 14 |
Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection |
提出Forensic-Chat框架,提升多模态大语言模型在伪造图像检测中的泛化性和可解释性。 |
large language model multimodal |
|
|
| 15 |
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images |
PixelCraft:用于结构化图像高保真视觉推理的多智能体系统 |
large language model multimodal |
✅ |
|
| 16 |
VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning |
提出VT-FSL框架,利用LLM桥接视觉与文本,提升小样本学习性能 |
large language model multimodal |
✅ |
|
| 17 |
Environment-Aware Satellite Image Generation with Diffusion Models |
提出环境感知扩散模型,用于生成高质量、环境相关的卫星图像。 |
foundation model multimodal |
|
|
| 18 |
FreeRet: MLLMs as Training-Free Retrievers |
FreeRet:无需训练,利用MLLM实现强大的多模态检索 |
large language model multimodal |
|
|
| 19 |
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks |
提出Euclid30K数据集并微调视觉语言模型,显著提升其空间感知与推理能力 |
large language model multimodal |
✅ |
|
| 20 |
UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark |
UI2V-Bench:提出一个基于理解的图生视频生成评测基准,关注语义理解与推理能力。 |
large language model multimodal |
|
|
| 21 |
VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models |
VISOR++:基于通用视觉输入的视觉语言模型行为引导方法 |
multimodal |
|
|
| 22 |
Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents |
CogniGPT:交互式多粒度线索探索,提升长视频理解的效率与可靠性 |
large language model |
|
|
| 23 |
Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models |
提出训练无关的令牌修剪方法以降低视觉语言模型的推理成本 |
multimodal |
|
|
| 24 |
Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency |
提出QL-Adapter,解决多对象图像编辑中数量和布局一致性问题 |
instruction following |
|
|