| 1 |
Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning |
Brain3D:基于多模态推理的脑电信号到3D视觉表征解码 |
large language model multimodal |
|
|
| 2 |
EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience |
提出EEG2Vision框架,利用低密度脑电信号实现高质量视觉重建,并提升脑机接口应用潜力。 |
large language model multimodal |
|
|
| 3 |
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models |
HAWK:多模态模型中基于头部重要性的视觉Token剪枝 |
large language model multimodal |
✅ |
|
| 4 |
Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts |
揭示多模态MoE模型“视而不思”现象,提出路由引导干预方法提升视觉推理能力。 |
multimodal |
|
|
| 5 |
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation |
提出SyncBreaker,一种针对语音驱动人像生成的多模态对抗攻击框架。 |
multimodal |
✅ |
|
| 6 |
DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection |
提出双分支多模态框架DBMF,用于提升医学图像领域OOD检测性能。 |
multimodal |
|
|
| 7 |
$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization |
提出SCADA-UL,通过对抗优化解决源域信息在免源域自适应中的泄露问题 |
zero-shot transfer |
✅ |
|
| 8 |
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment |
提出PaveGPT,通过领域指令微调实现全面的自动化路面状况评估 |
foundation model |
|
|
| 9 |
DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather |
DinoRADE:利用视觉基础模型特征的全光谱雷达-相机融合,用于恶劣天气下的多类别目标检测 |
foundation model |
✅ |
|
| 10 |
Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images |
利用预训练DINOv3,高效标注的电影图像附件肿块分割 |
foundation model |
✅ |
|
| 11 |
Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models |
提出LogitProd,一种即插即用的病理学Foundation Model Logit融合方法,提升下游任务性能。 |
foundation model |
|
|
| 12 |
Weight Group-wise Post-Training Quantization for Medical Foundation Model |
针对医学大模型的权重分组后训练量化方法,提升终端设备推理速度 |
foundation model |
|
|
| 13 |
AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation |
AVGen-Bench:一个面向多粒度评估的文本到音视频生成任务驱动型基准 |
large language model multimodal |
|
|
| 14 |
What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric |
提出基于视觉-语言模型的语义注视路径相似度评估框架,弥补传统方法对语义信息的忽略。 |
foundation model multimodal |
|
|
| 15 |
SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection |
SciFigDetect:首个AI生成科学图检测基准,揭示现有检测方法在科学图像领域的不足。 |
multimodal zero-shot transfer |
✅ |
|
| 16 |
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding |
提出Bridge-STG,解耦时空对齐,提升多模态大语言模型在视频定位任务中的性能。 |
large language model multimodal |
|
|
| 17 |
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation |
提出Tarot-SAM3,一种无需训练的SAM3框架,用于任意指代表达式分割。 |
large language model multimodal |
|
|
| 18 |
AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models |
AgriChain:基于视觉专家验证推理的可解释农业视觉语言模型 |
multimodal chain-of-thought |
✅ |
|
| 19 |
ParseBench: A Document Parsing Benchmark for AI Agents |
提出ParseBench以解决文档解析中的语义正确性问题 |
visual grounding |
✅ |
|
| 20 |
Phantasia: Context-Adaptive Backdoors in Vision Language Models |
提出Phantasia:一种视觉语言模型中上下文自适应的后门攻击方法 |
multimodal |
|
|
| 21 |
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models |
PokeGym:一个视觉驱动的、面向视觉-语言模型长程任务的评测基准。 |
visual grounding |
|
|
| 22 |
Revisiting Radar Perception With Spectral Point Clouds |
提出光谱点云,提升雷达感知模型在不同传感器间的泛化能力。 |
foundation model |
|
|
| 23 |
DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning |
提出DiffVC:一种基于扩散模型的非自回归视频字幕生成框架 |
multimodal |
|
|
| 24 |
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding |
AdaSpark:面向高效长视频理解的自适应稀疏框架 |
large language model |
|
|
| 25 |
Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments |
提出FI3Det框架,利用视觉-语言模型实现动态室内环境下的少样本增量3D目标检测。 |
multimodal |
✅ |
|
| 26 |
PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation |
PanoSAM2:轻量级且考虑畸变与内存的SAM2自适应方法,用于360视频目标分割 |
embodied AI |
|
|
| 27 |
RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs |
提出RemoteAgent,利用强化学习对Agentic MLLM进行微调,解决遥感领域模糊意图理解问题。 |
large language model |
|
|
| 28 |
Unified Multimodal Uncertain Inference |
提出统一多模态不确定性推理框架UMUI,解决跨模态概率校准推理难题。 |
multimodal |
|
|
| 29 |
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments |
MARINER:一个3E驱动的开放水域细粒度感知与复杂推理基准 |
large language model multimodal |
✅ |
|
| 30 |
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding |
提出3D-VCD,通过视觉对比解码缓解3D具身智能体中的幻觉问题 |
multimodal |
|
|
| 31 |
Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring |
提出LeanGate,通过几何效用评分加速基于Transformer的单目SLAM |
foundation model |
|
|