| 1 |
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models |
提出MASQuant,解决多模态大语言模型量化中的模态不对齐和跨模态计算不变性问题 |
large language model multimodal |
✅ |
|
| 2 |
Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary |
评估GPT-5作为多模态临床推理器的能力:一项全景式研究 |
foundation model multimodal chain-of-thought |
|
|
| 3 |
Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models |
提出复杂性感知自适应推理框架,提升VLA模型在复杂任务中的效率与可靠性 |
vision-language-action VLA |
|
|
| 4 |
NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries |
NaiLIA:基于密集意图描述和调色板查询的多模态美甲设计检索 |
foundation model multimodal |
|
|
| 5 |
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark |
UniM:一个统一的任意到任意交错多模态基准,旨在推进多模态大语言模型。 |
large language model multimodal |
✅ |
|
| 6 |
Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild |
评估多模态大语言模型在监控场景下零样本异常检测的可靠性,揭示其保守偏见。 |
large language model multimodal |
|
|
| 7 |
Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline |
提出MM-Lifelong数据集与ReMA模型,解决多模态终身理解中的记忆瓶颈与定位崩溃问题。 |
multimodal |
|
|
| 8 |
Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model |
Tell2Adapt:利用视觉基础模型实现无源域自适应的统一框架 |
foundation model |
✅ |
|
| 9 |
VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters |
提出VisionPangu,一个17亿参数的紧凑型多模态助手,提升图像细节描述能力。 |
multimodal |
|
|
| 10 |
Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation |
提出FedMEPD框架,解决多模态脑肿瘤分割中模态异构和个性化建模难题 |
multimodal |
|
|
| 11 |
Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models |
提出多范式协同对抗攻击MPCAttack,提升多模态大语言模型对抗样本的迁移性。 |
large language model |
✅ |
|
| 12 |
Revisiting Shape from Polarization in the Era of Vision Foundation Models |
利用高质量偏振数据和领域自适应,轻量模型在单视角表面法向量估计上超越视觉基础模型。 |
foundation model |
|
|
| 13 |
HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token |
HALP:无需生成任何token即可检测视觉语言模型中的幻觉 |
multimodal |
|
|
| 14 |
Layer by layer, module by module: Choose both for optimal OOD probing of ViT |
针对ViT,提出层与模块选择性OOD探测方法,优化分布偏移下的性能。 |
foundation model |
|
|
| 15 |
A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset |
提出基于颜色注意力机制的RT-DETR,用于360度多摄像头系统下的蓝色紧急车辆灯光检测。 |
multimodal |
|
|
| 16 |
MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents |
MultiHaystack:构建大规模跨模态检索与推理基准,评估MLLM在复杂场景下的性能瓶颈。 |
large language model multimodal |
|
|
| 17 |
Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection |
提出后融合稳定器PFS,提升多模态3D检测在域偏移和传感器失效下的鲁棒性 |
multimodal |
|
|