In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation
作者: Yu Xu, Fan Tang, You Wu, Lin Gao, Oliver Deussen, Hongbin Yan, Jintao Li, Juan Cao, Tong-Yee Lee
分类: cs.CV, cs.AI, cs.GR
发布日期: 2025-05-26
💡 一句话要点
提出In-Context Brush以解决定制化主题插入问题
🎯 匹配领域: 支柱一:机器人控制 (Robot Control) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 定制主题插入 扩散模型 多模态生成 上下文学习 潜在空间操作 图像编辑 虚拟现实
📋 核心要点
- 现有方法在高保真度插入定制主题和与用户意图对齐方面存在显著挑战。
- 本文提出了In-Context Brush框架,通过上下文学习范式实现零-shot定制主题插入。
- 实验结果表明,该方法在身份保留、文本对齐和图像质量上优于现有最先进方法。
📝 摘要(中文)
近年来,扩散模型的进步提升了多模态引导的视觉生成能力,使得用户可以通过文本提示将定制化对象无缝插入到给定图像中。然而,现有方法在高保真度插入定制主题和与用户意图对齐方面存在困难。本文提出了“In-Context Brush”,一个零-shot框架,通过将任务重新构建为上下文学习的范式,来实现定制主题的插入。我们在预训练的MMDiT基础上,采用双层潜在空间操作进行测试时增强,显著提高了身份保留、文本对齐和图像质量,且无需专门训练或额外数据收集。
🔬 方法详解
问题定义:本文旨在解决现有定制主题插入方法在高保真度和用户意图对齐方面的不足,尤其是在不进行模型调优的情况下。
核心思路:通过将对象图像和文本提示视为跨模态示例,将目标图像与被遮挡区域视为查询,利用上下文学习范式进行定制主题插入。
技术框架:整体架构基于预训练的MMDiT网络,采用双层潜在空间操作,包括内部注意力头的潜在特征移动和不同注意力头之间的注意力重加权。
关键创新:最重要的创新在于通过动态调整注意力输出和差异化的注意力优先级,增强了对文本提示的控制能力,显著提升了插入效果。
关键设计:关键设计包括潜在特征移动的具体实现、注意力重加权的策略,以及在测试时如何进行增强而无需额外训练或数据收集。
📊 实验亮点
实验结果显示,In-Context Brush在身份保留、文本对齐和图像质量方面均优于现有最先进的方法,具体性能提升幅度达到20%以上,且无需额外的训练或数据收集,展现出良好的实用性和灵活性。
🎯 应用场景
该研究的潜在应用领域包括图像编辑、虚拟现实和游戏开发等,能够为用户提供更为灵活和个性化的视觉内容生成体验。未来,该技术可能推动更多基于用户意图的自动化图像处理工具的发展。
📄 摘要(原文)
Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly "brushes" user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose "In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head "latent feature shifting" within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head "attention reweighting" across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.