| 1 |
Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior |
提出Pathology-CoT框架,从专家WSI诊断行为中学习视觉链式推理Agent |
foundation model chain-of-thought |
|
|
| 2 |
ActiveMark: on watermarking of visual foundation models via massive activations |
提出ActiveMark以解决视觉基础模型的水印保护问题 |
foundation model |
|
|
| 3 |
A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification |
提出空间-光谱-频率交互网络S²Fin,用于提升多模态遥感图像分类精度。 |
multimodal |
✅ |
|
| 4 |
Factuality Matters: When Image Generation and Editing Meet Structured Visuals |
针对结构化视觉生成与编辑的事实性问题,提出StructBench基准和多模态融合模型。 |
multimodal chain-of-thought |
|
|
| 5 |
MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models |
MedCLM:通过CoT课程学习医学视觉语言模型中的定位和推理 |
visual grounding chain-of-thought |
|
|
| 6 |
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation |
VChain:用于视频生成中推理的视觉思维链 |
multimodal |
|
|
| 7 |
Character Mixing for Video Generation |
提出CCE和CCA框架,实现跨世界观角色融合的视频生成,解决风格退化问题。 |
multimodal |
✅ |
|
| 8 |
Visual Representations inside the Language Model |
分析多模态大语言模型内部视觉表征,揭示其感知能力瓶颈与改进方向 |
multimodal |
|
|
| 9 |
Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics |
提出基于Transformer的对话动态人体识别方法,提升自然交互场景下身份识别精度。 |
multimodal |
|
|
| 10 |
ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion |
提出Blendshape引导的扩散模型,实现身份保持和精准表情生成。 |
foundation model |
✅ |
|
| 11 |
VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery |
提出VaseVQA-3D数据集和VaseVLM模型,解决3D文物领域视觉问答的数据稀缺和知识不足问题。 |
multimodal |
✅ |
|
| 12 |
Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting |
VLMCountBench揭示视觉语言模型在组合计数任务上的显著缺陷 |
embodied AI |
|
|