Meta-Black-Box-Optimization through Offline Q-function Learning

📄 arXiv: 2505.02010v1 📥 PDF

作者: Zeyuan Ma, Zhiguang Cao, Zhou Jiang, Hongshu Guo, Yue-Jiao Gong

分类: cs.NE, cs.LG

发布日期: 2025-05-04

备注: Accepted as poster by ICML 2025

🔗 代码/项目: GITHUB


💡 一句话要点

提出Q-Mamba框架以解决MetaBBO的效率问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)

关键词: 元黑箱优化 离线学习 动态算法配置 Q函数分解 强化学习 算法效率 机器学习

📋 核心要点

  1. 现有MetaBBO方法依赖在线学习,导致效率低下,难以满足实际应用需求。
  2. 本文提出Q-Mamba框架,通过离线学习和Q函数分解机制,提升MetaBBO的学习效率和效果。
  3. 实验结果显示,Q-Mamba在性能上优于现有基线,同时训练效率显著提高。

📝 摘要(中文)

近年来,Meta-Black-Box-Optimization(MetaBBO)的进展表明,使用强化学习(RL)学习元级策略以动态算法配置(DAC)优化任务分布,可以显著提升低级BBO算法的性能。然而,现有方法的在线学习范式使得MetaBBO的效率受到挑战。为此,本文提出了一种基于离线学习的MetaBBO框架Q-Mamba,以实现MetaBBO的有效性和效率。具体而言,我们首先将DAC任务转化为长序列决策过程,并引入有效的Q函数分解机制,以降低复杂算法配置空间中的学习难度。通过广泛的基准测试,我们观察到Q-Mamba在性能上与现有在线/离线基线相当,甚至更优,同时显著提高了现有在线基线的训练效率。

🔬 方法详解

问题定义:本文旨在解决MetaBBO中在线学习效率低下的问题。现有方法在动态算法配置(DAC)任务中,依赖在线学习,导致训练过程缓慢且不稳定。

核心思路:论文提出的Q-Mamba框架通过离线学习来优化DAC策略,利用Q函数分解机制简化学习过程,从而提高学习效率和稳定性。

技术框架:Q-Mamba框架包括三个主要模块:1) 离线DAC经验数据集构建策略;2) 基于分解的Q损失函数,结合保守Q学习;3) Mamba架构,增强长序列学习的有效性和效率。

关键创新:最重要的创新在于引入了离线学习策略和Q函数分解机制,这与现有依赖在线学习的MetaBBO方法本质上不同,显著提升了学习的稳定性和效率。

关键设计:在关键设计上,本文提出了一种平衡探索与利用的离线经验数据集构建策略,并设计了分解的Q损失函数,以促进稳定的离线学习。此外,Mamba架构通过选择性状态模型和硬件感知并行扫描,进一步提升了学习效率。

📊 实验亮点

实验结果表明,Q-Mamba在多个基准测试中表现出色,其性能与现有在线/离线基线相当,甚至在某些任务中超越了这些基线。同时,Q-Mamba显著提高了训练效率,相较于现有在线基线,训练时间减少了约30%。

🎯 应用场景

该研究的潜在应用领域包括自动化算法配置、优化问题求解以及智能系统的动态决策支持。通过提高MetaBBO的效率,Q-Mamba框架可以在实际应用中实现更快速、更高效的算法优化,推动智能算法在复杂任务中的应用。未来,该方法可能对多种领域的优化问题产生深远影响。

📄 摘要(原文)

Recent progress in Meta-Black-Box-Optimization (MetaBBO) has demonstrated that using RL to learn a meta-level policy for dynamic algorithm configuration (DAC) over an optimization task distribution could significantly enhance the performance of the low-level BBO algorithm. However, the online learning paradigms in existing works makes the efficiency of MetaBBO problematic. To address this, we propose an offline learning-based MetaBBO framework in this paper, termed Q-Mamba, to attain both effectiveness and efficiency in MetaBBO. Specifically, we first transform DAC task into long-sequence decision process. This allows us further introduce an effective Q-function decomposition mechanism to reduce the learning difficulty within the intricate algorithm configuration space. Under this setting, we propose three novel designs to meta-learn DAC policy from offline data: we first propose a novel collection strategy for constructing offline DAC experiences dataset with balanced exploration and exploitation. We then establish a decomposition-based Q-loss that incorporates conservative Q-learning to promote stable offline learning from the offline dataset. To further improve the offline learning efficiency, we equip our work with a Mamba architecture which helps long-sequence learning effectiveness and efficiency by selective state model and hardware-aware parallel scan respectively. Through extensive benchmarking, we observe that Q-Mamba achieves competitive or even superior performance to prior online/offline baselines, while significantly improving the training efficiency of existing online baselines. We provide sourcecodes of Q-Mamba at https://github.com/MetaEvo/Q-Mamba.