cs.CV（2025-12-18）

📊 共 37 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (12 🔗5) 支柱九：具身大模型 (Embodied Foundation Models) (10 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗1) 支柱一：机器人控制 (Robot Control) (4) 支柱四：生成式动作 (Generative Motion) (2) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals	KineST：一种基于运动学引导的时空状态空间模型，用于从稀疏信号中进行人体运动跟踪	state space model representation learning spatiotemporal	✅
2	BrepLLM: Native Boundary Representation Understanding with Large Language Models	BrepLLM：提出一种原生边界表示理解的大语言模型框架	contrastive learning semantic mapping semantic map
3	SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning	SNOW：利用世界知识进行时空场景理解，实现开放世界具身推理	world model scene understanding multimodal
4	AdaTooler-V: Adaptive Tool-Use for Images and Videos	提出AdaTooler-V，通过自适应工具使用提升多模态大语言模型在图像和视频任务中的推理效率和性能。	reinforcement learning large language model multimodal
5	Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation	提出基于3D感知表达蒸馏的快速高表现力高斯头部头像方法	distillation gaussian splatting splatting
6	SARMAE: Masked Autoencoder for SAR Representation Learning	SARMAE：面向SAR图像表征学习的噪声感知掩码自编码器	representation learning masked autoencoder	✅
7	The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text	WorldCanvas：结合文本、轨迹和参考图像，实现可控的世界事件模拟。	world model multimodal visual grounding	✅
8	Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation	提出TODSynth框架，用于遥感语义分割任务的数据合成与控制优化。	flow matching foundation model multimodal
9	MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval	提出MACL，解决遥感图像检索中多标签语义重叠和类别不平衡问题	representation learning contrastive learning	✅
10	Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization	提出基于骨骼片段对比学习和多尺度特征融合的动作定位方法	contrastive learning
11	MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning	提出MomaGraph，利用视觉-语言模型为具身任务规划构建状态感知的统一场景图。	reinforcement learning scene understanding
12	TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times	TurboDiffusion：通过多重加速策略将视频扩散模型提速100-200倍	linear attention distillation	✅

🔬 支柱九：具身大模型 (Embodied Foundation Models) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation	Causal-Tune：挖掘视觉基础模型中的因果因子，用于领域泛化语义分割	foundation model
14	Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors	提出EMER数据集和EMERT模型，利用眼部行为弥合面部表情识别和情感识别之间的差距	multimodal	✅
15	Kling-Omni Technical Report	Kling-Omni：通用生成框架，实现多模态输入到高质量视频的端到端合成	multimodal instruction following
16	Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs	提出Sketch-in-Latents (SkiLa)，实现MLLM中统一的多模态推理与视觉想象。	large language model multimodal	✅
17	PixelArena: A benchmark for Pixel-Precision Visual Intelligence	PixelArena：提出像素级视觉智能评测基准，评估多模态大模型图像生成能力。	large language model multimodal
18	VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization	VIVA：利用VLM引导和奖励优化的指令驱动视频编辑框架	instruction following
19	REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion	REGLUE：利用全局和局部语义增强潜在扩散模型，提升图像合成质量。	foundation model	✅
20	VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks	提出VenusBench-GD，一个全面的多平台GUI基准，用于评估多样化的Grounding任务。	multimodal
21	Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation	Avatar4D：合成特定领域4D人体数据，用于真实场景姿态估计	zero-shot transfer
22	Machine Learning Enabled Graph Analysis of Particulate Composites: Application to Solid-state Battery Cathodes	提出基于机器学习的图分析方法，用于固态电池正极材料微观结构表征与性能预测。	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation	提出全景深度估计基础模型DAP，提升跨场景距离的泛化能力。	depth estimation metric depth geometric consistency	✅
24	SDFoam: Signed-Distance Foam for explicit surface reconstruction	SDFoam：结合显式Voronoi图和隐式SDF，实现精确表面重建	3D gaussian splatting 3DGS gaussian splatting
25	N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models	N3D-VLM：原生3D感知赋能视觉语言模型精确空间推理	depth estimation spatial relationship multimodal
26	4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction	提出4D Primitive-Mâché，通过拼接基元实现持久化4D场景重建	scene reconstruction
27	Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture	利用高斯溅射重建高保真面部几何与纹理，实现可控人脸生成	gaussian splatting splatting NeRF
28	Auto-Vocabulary 3D Object Detection	提出AV3DOD，实现无需用户干预的自动词汇3D目标检测	open-vocabulary open vocabulary

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
29	GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation	GeoPredict：利用预测运动学和3D高斯几何实现精确的VLA操作	manipulation vision-language-action VLA
30	Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation	提出Make-It-Poseable，解决3D人形角色动画中姿态控制难题	humanoid character animation
31	OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction	OPENTOUCH：构建真实场景下全手触觉交互数据集与基准	manipulation egocentric multimodal
32	TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering	提出TextEditBench，用于评估图像文本编辑中蕴含推理能力，超越简单的渲染效果。	manipulation multimodal

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos	EgoMAN：基于自中心交互视频学习3D手部轨迹预测，实现推理到运动的衔接	motion generation egocentric
34	Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach	提出基于注视启动的人体运动合成方法，用于模拟抓取或放置物体的自然行为。	motion generation

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
35	EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation	提出EverybodyDance，通过二分图匹配解决多角色动画中的身份对应问题。	character animation
36	EasyV2V: A High-quality Instruction-based Video Editing Framework	EasyV2V：高质量的基于指令的视频编辑框架，实现超越现有商业系统的性能。	spatiotemporal	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation	提出SegGraph，利用SAM分割图进行少样本3D部件分割	spatial relationship foundation model	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页