cs.CV(2025-12-18)

📊 共 37 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (12 🔗5) 支柱九:具身大模型 (Embodied Foundation Models) (10 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗1) 支柱一:机器人控制 (Robot Control) (4) 支柱四:生成式动作 (Generative Motion) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
1 KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals KineST:一种基于运动学引导的时空状态空间模型,用于从稀疏信号中进行人体运动跟踪 state space model representation learning spatiotemporal
2 BrepLLM: Native Boundary Representation Understanding with Large Language Models BrepLLM:提出一种原生边界表示理解的大语言模型框架 contrastive learning semantic mapping semantic map
3 SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning SNOW:利用世界知识进行时空场景理解,实现开放世界具身推理 world model scene understanding multimodal
4 AdaTooler-V: Adaptive Tool-Use for Images and Videos 提出AdaTooler-V,通过自适应工具使用提升多模态大语言模型在图像和视频任务中的推理效率和性能。 reinforcement learning large language model multimodal
5 Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation 提出基于3D感知表达蒸馏的快速高表现力高斯头部头像方法 distillation gaussian splatting splatting
6 SARMAE: Masked Autoencoder for SAR Representation Learning SARMAE:面向SAR图像表征学习的噪声感知掩码自编码器 representation learning masked autoencoder
7 The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text WorldCanvas:结合文本、轨迹和参考图像,实现可控的世界事件模拟。 world model multimodal visual grounding
8 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 提出TODSynth框架,用于遥感语义分割任务的数据合成与控制优化。 flow matching foundation model multimodal
9 MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval 提出MACL,解决遥感图像检索中多标签语义重叠和类别不平衡问题 representation learning contrastive learning
10 Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization 提出基于骨骼片段对比学习和多尺度特征融合的动作定位方法 contrastive learning
11 MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning 提出MomaGraph,利用视觉-语言模型为具身任务规划构建状态感知的统一场景图。 reinforcement learning scene understanding
12 TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times TurboDiffusion:通过多重加速策略将视频扩散模型提速100-200倍 linear attention distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
13 Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation Causal-Tune:挖掘视觉基础模型中的因果因子,用于领域泛化语义分割 foundation model
14 Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors 提出EMER数据集和EMERT模型,利用眼部行为弥合面部表情识别和情感识别之间的差距 multimodal
15 Kling-Omni Technical Report Kling-Omni:通用生成框架,实现多模态输入到高质量视频的端到端合成 multimodal instruction following
16 Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs 提出Sketch-in-Latents (SkiLa),实现MLLM中统一的多模态推理与视觉想象。 large language model multimodal
17 PixelArena: A benchmark for Pixel-Precision Visual Intelligence PixelArena:提出像素级视觉智能评测基准,评估多模态大模型图像生成能力。 large language model multimodal
18 VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization VIVA:利用VLM引导和奖励优化的指令驱动视频编辑框架 instruction following
19 REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion REGLUE:利用全局和局部语义增强潜在扩散模型,提升图像合成质量。 foundation model
20 VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks 提出VenusBench-GD,一个全面的多平台GUI基准,用于评估多样化的Grounding任务。 multimodal
21 Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation Avatar4D:合成特定领域4D人体数据,用于真实场景姿态估计 zero-shot transfer
22 Machine Learning Enabled Graph Analysis of Particulate Composites: Application to Solid-state Battery Cathodes 提出基于机器学习的图分析方法,用于固态电池正极材料微观结构表征与性能预测。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
23 Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation 提出全景深度估计基础模型DAP,提升跨场景距离的泛化能力。 depth estimation metric depth geometric consistency
24 SDFoam: Signed-Distance Foam for explicit surface reconstruction SDFoam:结合显式Voronoi图和隐式SDF,实现精确表面重建 3D gaussian splatting 3DGS gaussian splatting
25 N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models N3D-VLM:原生3D感知赋能视觉语言模型精确空间推理 depth estimation spatial relationship multimodal
26 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction 提出4D Primitive-Mâché,通过拼接基元实现持久化4D场景重建 scene reconstruction
27 Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture 利用高斯溅射重建高保真面部几何与纹理,实现可控人脸生成 gaussian splatting splatting NeRF
28 Auto-Vocabulary 3D Object Detection 提出AV3DOD,实现无需用户干预的自动词汇3D目标检测 open-vocabulary open vocabulary

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
29 GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation GeoPredict:利用预测运动学和3D高斯几何实现精确的VLA操作 manipulation vision-language-action VLA
30 Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation 提出Make-It-Poseable,解决3D人形角色动画中姿态控制难题 humanoid character animation
31 OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction OPENTOUCH:构建真实场景下全手触觉交互数据集与基准 manipulation egocentric multimodal
32 TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering 提出TextEditBench,用于评估图像文本编辑中蕴含推理能力,超越简单的渲染效果。 manipulation multimodal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
33 Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos EgoMAN:基于自中心交互视频学习3D手部轨迹预测,实现推理到运动的衔接 motion generation egocentric
34 Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach 提出基于注视启动的人体运动合成方法,用于模拟抓取或放置物体的自然行为。 motion generation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
35 EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation 提出EverybodyDance,通过二分图匹配解决多角色动画中的身份对应问题。 character animation
36 EasyV2V: A High-quality Instruction-based Video Editing Framework EasyV2V:高质量的基于指令的视频编辑框架,实现超越现有商业系统的性能。 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
37 SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation 提出SegGraph,利用SAM分割图进行少样本3D部件分割 spatial relationship foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页