| 1 |
Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory |
提出多模态项目反应理论(M3IRT)框架,用于评估多模态大语言模型的跨模态推理能力。 |
large language model multimodal |
|
|
| 2 |
Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches |
提出基于多模态LLM的暂停感知解码方法,实现游戏视频实时解说生成 |
large language model multimodal |
|
|
| 3 |
How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities |
提出SteerEval,用于多粒度评估大语言模型的可控性,揭示现有方法在细粒度控制上的不足。 |
large language model |
|
|
| 4 |
TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health |
TrustMH-Bench:用于评估大语言模型在心理健康领域可信度的综合基准 |
large language model |
|
|
| 5 |
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models |
TAO-Attack:面向大语言模型的高级优化型越狱攻击方法 |
large language model |
|
|
| 6 |
A Browser-based Open Source Assistant for Multimodal Content Verification |
提出基于浏览器的开源多模态内容核查助手,辅助记者和事实核查人员快速验证数字媒体信息。 |
multimodal |
|
|
| 7 |
OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets |
通过大规模数据集,重新评估MLLM时代下文档信息抽取中OCR的必要性。 |
large language model multimodal |
|
|
| 8 |
From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench |
KMP-Bench:评估LLM在K-8数学教学中教学智能的综合基准 |
large language model |
|
|
| 9 |
GPUTOK: GPU Accelerated Byte Level BPE Tokenization |
GPUTOK:利用GPU加速字节级BPE分词,提升长文本处理效率 |
large language model |
|
|
| 10 |
APRES: An Agentic Paper Revision and Evaluation System |
APRES:一种基于LLM的论文修订与评估系统,提升论文质量与影响力。 |
large language model |
|
|
| 11 |
Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection |
提出基于指令调优LLM的紧凑提示方法,用于联合论证成分检测。 |
large language model |
|
|
| 12 |
Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs |
知识追踪模型在教育预测任务中优于大型语言模型,更快速、经济、准确 |
large language model |
|
|
| 13 |
Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration |
提出DiSE,通过序列再生实现扩散语言模型的高效自评估。 |
large language model |
|
|
| 14 |
ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs |
提出基于规范化和确定性解析的推理方法,提升LLM在形式推理任务中的性能 |
large language model |
|
|
| 15 |
Eval4Sim: An Evaluation Framework for Persona Simulation |
提出Eval4Sim框架以解决对话模拟评估不足的问题 |
large language model |
|
|
| 16 |
LaTeX Compilation: Challenges in the Era of LLMs |
针对LLM时代TeX局限性,提出Mogan STEM编辑器以提升编译效率和LLM微调性能 |
large language model |
|
|
| 17 |
Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models |
提出跨模型家族推测预填充,利用小模型草稿实现免训练长文本压缩。 |
large language model |
|
|
| 18 |
Think, But Don't Overthink: Reproducing Recursive Language Models |
复现递归语言模型:过深递归导致模型“过度思考” |
large language model |
✅ |
|