| 1 |
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues |
提出MT-Video-Bench,用于评估多模态LLM在多轮对话中的视频理解能力 |
large language model multimodal |
|
|
| 2 |
$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs |
VisiPruner:解码多模态LLM中的非连续跨模态动态,实现高效剪枝 |
large language model multimodal |
✅ |
|
| 3 |
Towards a Generalizable Fusion Architecture for Multimodal Object Detection |
提出FMCAF架构,提升多模态目标检测的泛化能力与鲁棒性 |
multimodal |
|
|
| 4 |
Glyph: Scaling Context Windows via Visual-Text Compression |
Glyph:通过视觉-文本压缩扩展大语言模型的上下文窗口 |
large language model multimodal |
✅ |
|
| 5 |
Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention |
提出基于分层交错块注意力(HIBA)的Xihe,用于可扩展的零样本时间序列学习。 |
foundation model zero-shot transfer |
|
|
| 6 |
iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA |
提出iDETEX,赋能多模态大语言模型实现智能、详细、可解释的图像质量评估 |
large language model multimodal |
|
|
| 7 |
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference |
SparseVILA:解耦视觉稀疏性,加速高效VLM推理 |
multimodal |
|
|
| 8 |
Elastic ViTs from Pretrained Models without Retraining |
提出SnapViT,无需重训练即可从预训练ViT模型中获得弹性计算能力。 |
foundation model |
|
|
| 9 |
ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input |
ImaGGen:基于语言和图像输入的零样本共语语义手势生成 |
multimodal |
✅ |
|
| 10 |
Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization |
提出一种上下文感知伪标签评分的零样本视频摘要框架,提升LLM在视频摘要任务中的性能。 |
large language model |
|
|
| 11 |
Monitoring Horses in Stalls: From Object to Event Detection |
提出基于YOLOv11和BoT-SORT的马厩马匹行为监测系统,实现事件自动检测。 |
foundation model |
|
|
| 12 |
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs |
提出基于循环注意力的Token选择方法,用于高效的流式视频-LLM |
large language model |
|
|
| 13 |
Exploring The Missing Semantics In Event Modality |
提出Semantic-E2VID,利用视觉语义知识增强事件到视频的重建效果 |
foundation model |
|
|