| 1 |
COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation |
COCO-Urdu:构建大规模乌尔都语图像描述数据集,并提出多模态质量评估框架。 |
large language model multimodal visual grounding |
|
|
| 2 |
Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles |
提出MMB方法,通过多模态贝叶斯提示集成校准MLLM在文图生成评判中的偏差。 |
large language model multimodal |
|
|
| 3 |
Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation |
提出基于EM聚合和文本引导解码的视觉-语言语义聚合方法,提升医学图像分割的泛化性。 |
foundation model multimodal |
|
|
| 4 |
MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance |
提出大规模多模态智能交通监控数据集MITS,提升LMM在ITS领域的性能 |
multimodal instruction following |
|
|
| 5 |
An Open Benchmark Dataset for GeoAI Foundation Models for Oil Palm Mapping in Indonesia |
发布印尼油棕榈测绘GeoAI基础模型开放基准数据集,助力可持续发展。 |
foundation model PaLM-E |
|
|
| 6 |
Recurrence Meets Transformers for Universal Multimodal Retrieval |
提出ReT-2,一种支持多模态查询的通用多模态检索模型。 |
multimodal |
✅ |
|
| 7 |
Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis |
提出检索增强的视觉-语言模型,用于提升多模态黑色素瘤诊断的准确性。 |
multimodal |
|
|
| 8 |
A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval |
提出多模态RAG框架,用于灾后房屋损伤评估,协同优化图像编码和策略向量检索。 |
multimodal |
|
|
| 9 |
BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion |
提出基于BreezeCLIP的BcQLM轻量级MLLM框架,用于高效视觉语言理解。 |
large language model multimodal |
✅ |
|
| 10 |
AdsQA: Towards Advertisement Video Understanding |
提出AdsQA广告视频理解基准,并设计ReAd-R模型提升LLM在广告领域的应用能力。 |
large language model |
|
|