| 1 |
QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models |
QAPruner:面向多模态大语言模型的量化感知视觉Token剪枝 |
large language model multimodal |
|
|
| 2 |
ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization |
提出ForgeryGPT,用于可解释的图像伪造检测与定位,并支持交互式对话。 |
large language model multimodal instruction following |
|
|
| 3 |
Multimodal Language Models Cannot Spot Spatial Inconsistencies |
提出多视角空间一致性评估方法,揭示MLLM在3D推理上的不足 |
large language model multimodal |
|
|
| 4 |
Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs |
提出 Guideline2Graph,将临床指南解析为可执行的临床决策图,显著提升解析精度。 |
multimodal |
|
|
| 5 |
Token-Efficient Multimodal Reasoning via Image Prompt Packaging |
提出图像提示打包方法以降低多模态推理成本 |
multimodal |
|
|
| 6 |
Rapidly deploying on-device eye tracking by distilling visual foundation models |
DistillGaze:通过蒸馏视觉基础模型实现快速部署的设备端眼动追踪 |
foundation model |
|
|
| 7 |
Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery |
Smart Transfer:利用视觉基础模型快速绘制震后高分辨率影像的建筑物损毁图 |
foundation model |
|
|
| 8 |
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications |
MOMO:用于火星轨道应用的多传感器融合火星轨道模型 |
foundation model |
|
|
| 9 |
MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling |
MMPhysVideo:通过联合多模态建模提升视频生成中物理合理性 |
multimodal |
|
|
| 10 |
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models |
提出基于程序几何数据生成和视觉语言模型的几何教育视觉解释方法 |
visual grounding |
|
|
| 11 |
The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment |
提出对比融合ConFu框架,用于捕获高阶多模态对齐中的复杂依赖关系。 |
multimodal |
|
|
| 12 |
EGM: Efficient Visual Grounding Language Models |
提出EGM:通过生成更多中等质量tokens,提升小型视觉语言模型在视觉定位任务中的效率。 |
visual grounding |
|
|
| 13 |
MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models |
MuRF:释放视觉基础模型的多尺度潜力,提升推理性能 |
foundation model |
|
|
| 14 |
Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs |
Efficient3D:用于3D MLLM中自适应和去偏Token缩减的统一框架 |
large language model multimodal |
|
|
| 15 |
Token Warping Helps MLLMs Look from Nearby Viewpoints |
Token Warping:提升多模态大语言模型在视角变换下的推理能力 |
large language model multimodal |
|
|
| 16 |
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding |
提出ProVCA:一种基于MLLM Agent的渐进式视频精简方法,用于长视频理解 |
large language model multimodal |
|
|
| 17 |
SentiAvatar: Towards Expressive and Interactive Digital Humans |
SentiAvatar:构建富有表现力和交互性的数字人框架 |
foundation model multimodal |
|
|
| 18 |
PolyReal: A Benchmark for Real-World Polymer Science Workflows |
PolyReal:面向真实世界聚合物科学工作流的多模态大语言模型基准 |
large language model multimodal |
|
|
| 19 |
MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs |
MI-Pruner:基于互信息的跨模态视觉Token剪枝方法,提升MLLM效率 |
large language model multimodal |
|
|
| 20 |
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models |
CoDA:探索医学视觉-语言模型中的链式分布攻击与事后Token空间修复 |
large language model multimodal |
|
|
| 21 |
Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation |
提出Inertia-aware Visual Excitation方法,缓解多模态大语言模型中的认知幻觉问题 |
large language model multimodal |
|
|
| 22 |
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors |
视觉语言模型过度依赖语义锚点,忽略视觉细节,限制了其视觉推理能力。 |
multimodal |
|
|
| 23 |
EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors |
EnsemHalDet:通过集成内部状态检测器实现鲁棒的VLM幻觉检测 |
multimodal |
|
|
| 24 |
SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection |
提出基于稀疏自编码器的稀疏投影引导SPG,用于零样本异常检测。 |
foundation model |
|
|
| 25 |
CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation |
提出CrossWeaver,用于任意模态语义分割的跨模态融合框架 |
multimodal |
|
|
| 26 |
QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection |
提出QVAD框架以解决视频异常检测中的静态查询问题 |
foundation model |
|
|
| 27 |
A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification |
提出基于数据增强的SAR海冰分类ViT基线,提升稀有冰类识别精度。 |
multimodal |
|
|
| 28 |
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models |
提出UCGP,针对红外视觉-语言模型的通用物理对抗补丁框架 |
multimodal |
|
|
| 29 |
EffiMiniVLM: A Compact Dual-Encoder Regression Framework |
提出EffiMiniVLM,一种紧凑的双编码器回归框架,用于解决冷启动场景下的产品质量预测问题。 |
multimodal |
|
|
| 30 |
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes |
提出GSAM,通过随机裁剪高效微调SAM以适应可变输入图像尺寸 |
foundation model |
|
|
| 31 |
Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding |
提出基于隐空间的匿名化适配模块,用于保护视频理解模型的隐私 |
foundation model |
|
|
| 32 |
SAGA: Source Attribution of Generative AI Videos |
SAGA:首个生成式AI视频溯源框架,实现多粒度模型溯源与可解释性分析。 |
foundation model |
|
|
| 33 |
Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation |
提出基于低秩解码器自适应的高效测试时深度补全方法 |
foundation model |
|
|
| 34 |
When Negation Is a Geometry Problem in Vision-Language Models |
提出基于表征工程的测试时干预方法,提升CLIP模型对文本否定语义的理解能力。 |
multimodal |
|
|
| 35 |
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks |
提出随机标签桥接训练,实现语言模型向视觉任务的有效迁移 |
large language model |
|
|
| 36 |
Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance |
揭示视觉语言模型在几何变换下的脆弱性,挑战其视觉不变性 |
multimodal |
|
|