| 11 |
VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models |
提出VESSA:一种基于视频对象中心的自监督视觉基础模型适应方法 |
distillation foundation model |
✅ |
|
| 12 |
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence |
Conan:提出基于多尺度视觉证据的渐进式学习框架,提升多模态大语言模型在视频推理任务上的性能。 |
reinforcement learning large language model multimodal |
|
|
| 13 |
A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development |
针对脑MRI基础模型,论文系统评估了公开数据集的多样性与一致性问题。 |
representation learning foundation model |
|
|
| 14 |
GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs |
GranViT:面向MLLM的细粒度视觉模型,通过自回归感知提升性能 |
distillation large language model multimodal |
|
|
| 15 |
Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs |
提出WM-MoE框架,利用世界模型和混合专家模型解决自动驾驶Corner Case问题 |
world model large language model |
|
|
| 16 |
Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection |
提出CURL框架,利用对比学习进行胎儿超声视频中的胎动检测。 |
representation learning contrastive learning |
|
|
| 17 |
Generative Point Tracking with Flow Matching |
提出基于Flow Matching的生成式点跟踪器GenPT,解决视觉遮挡下的多模态轨迹预测问题。 |
flow matching |
|
|
| 18 |
TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge |
TernaryCLIP:通过三元权重和知识蒸馏高效压缩视觉-语言模型 |
distillation multimodal |
|
|
| 19 |
IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks |
提出IB-GAN,利用信息瓶颈改进GAN的解耦表示学习。 |
representation learning |
|
|
| 20 |
TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning |
提出TOMCAT,通过测试时知识累积解决组合零样本学习中的分布偏移问题。 |
representation learning multimodal |
✅ |
|