| 1 |
Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning |
提出基于多模态CoT推理的可解释动作形态评估方法与数据集,解决动作标准化评估问题。 |
multimodal chain-of-thought |
✅ |
|
| 2 |
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning |
Skyra:通过可信的伪影推理实现AI生成视频检测 |
large language model multimodal |
|
|
| 3 |
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models |
提出GRAN-TED框架,用于生成鲁棒、对齐和细致的扩散模型文本嵌入。 |
large language model multimodal |
|
|
| 4 |
EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration |
EmoCaliber:通过置信度表达与校准,提升视觉情感理解的可靠性 |
large language model multimodal |
✅ |
|
| 5 |
Step-GUI Technical Report |
提出Step-GUI,通过自进化训练和GUI-MCP协议,实现高效、安全、通用的GUI自动化。 |
large language model multimodal |
|
|
| 6 |
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models |
DiffusionVL:将任意自回归模型转化为扩散视觉语言模型,提升性能与推理速度。 |
multimodal |
✅ |
|
| 7 |
Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics |
提出TIMAR,用于建模交互式3D对话头部的因果turn级动态生成。 |
multimodal |
✅ |
|
| 8 |
Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models |
对比分析专用计数架构与视觉-语言模型在视觉枚举任务中的性能 |
multimodal |
|
|
| 9 |
Uni-Parser Technical Report |
Uni-Parser:面向科学文献和专利的高通量文档解析引擎 |
large language model |
|
|
| 10 |
PMMD: A pose-guided multi-view multi-modal diffusion for person generation |
提出PMMD框架,通过多视角多模态扩散模型实现姿态引导下的高质量人物生成。 |
multimodal |
✅ |
|