cs.CV（2026-03-04）

📊 共 40 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (13 🔗3) 支柱九：具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗3) 支柱一：机器人控制 (Robot Control) (3) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱四：生成式动作 (Generative Motion) (2 🔗1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR	EgoPoseFormer v2：用于AR/VR的精准第一人称视角人体运动估计	teacher-student distillation egocentric
2	PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation	PROSPECT：通过语义-空间融合和潜在预测表征实现统一的流式视觉-语言导航	predictive model representation learning vision-language-action
3	Scaling Dense Event-Stream Pretraining from Visual Foundation Models	提出一种基于视觉基础模型的事件流预训练方法，解决事件表示的语义坍塌问题。	distillation foundation model
4	Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection	提出CMDR-IAD以解决多模态工业异常检测问题	teacher-student multimodal	✅
5	From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning	提出AVAR框架，解决多模态大模型冷启动阶段的注意力分配问题，显著提升推理性能。	reward shaping multimodal	✅
6	CoRe-BT: A Multimodal Radiology-Pathology-Text Benchmark for Robust Brain Tumor Typing	CoRe-BT：用于鲁棒性脑肿瘤分型的多模态放射-病理-文本基准数据集	representation learning multimodal
7	Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning	提出基于注视稳定性和瞳孔新颖性的双重标准框架策展方法，用于高效的以自我为中心的学习。	imitation learning egocentric
8	Discriminative Perception via Anchored Description for Reasoning Segmentation	提出DPAD，通过锚定描述实现判别感知，提升推理分割性能。	reinforcement learning large language model multimodal	✅
9	Separators in Enhancing Autoregressive Pretraining for Vision Mamba	提出STAR，通过分隔符增强Vision Mamba的自回归预训练，提升长序列处理能力。	Mamba state space model
10	TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning	TaxonRL：利用强化学习与中间奖励实现可解释的细粒度视觉推理	reinforcement learning
11	DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers	DiverseDiT：通过扩散Transformer中的多样性表示学习提升图像合成质量。	representation learning
12	UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization	UniRain：提出基于RAG的数据集蒸馏和多目标重加权优化的统一图像去雨框架	distillation
13	Vector-Quantized Soft Label Compression for Dataset Distillation	提出基于向量量化自编码器的软标签压缩方法，用于加速数据集蒸馏并降低存储开销。	distillation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe	提出一种单样本探针方法，用于预测VLFM在欠表示领域上的零样本精度。	large language model foundation model	✅
15	Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions	提出图像提示注入攻击，利用视觉嵌入对抗指令劫持多模态大语言模型	large language model multimodal
16	PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters	提出PlaneCycle，无需训练和适配器即可将2D预训练模型迁移至3D任务	foundation model	✅
17	Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers	针对小patch病理图像，任务特定CNN优于预训练模型，且数据量是关键	foundation model
18	ProFound: A moderate-sized vision foundation model for multi-task prostate imaging	ProFound：用于多任务前列腺成像的中等规模视觉基础模型	foundation model
19	Towards Generalized Multimodal Homography Estimation	提出一种广义多模态单应性估计方法，提升跨模态泛化能力	multimodal
20	Universal Pansharpening Foundation Model	提出FoundPS通用Pansharpening基础模型，实现卫星无关和场景鲁棒的图像融合。	foundation model
21	RIVER: A Real-Time Interaction Benchmark for Video LLMs	提出RIVER基准，用于评估视频大语言模型在实时交互场景下的性能。	large language model multimodal	✅
22	EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs	EvoPrune：面向高效多模态大语言模型的早期视觉Token剪枝	large language model multimodal
23	Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection	Pointer-CAD：通过指针式边/面选择统一B-Rep和命令序列，提升CAD生成质量	large language model
24	SSR: A Generic Framework for Text-Aided Map Compression for Localization	提出SSR框架，利用文本辅助地图压缩，提升定位效率并降低存储成本。	large language model
25	RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation	提出RAGTrack，利用检索增强生成框架解决RGBT跟踪中目标建模和模态融合问题。	large language model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
26	EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding	提出EmbodiedSplat，用于在线开放词汇3D场景理解的feed-forward语义3DGS方法。	3DGS scene understanding open-vocabulary	✅
27	DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping	DISC：用于大规模开放集语义地图构建的密集集成语义上下文方法	semantic mapping semantic map	✅
28	Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation	Crab$^{+}$：通过显式协作实现可扩展的统一音视频场景理解模型	scene understanding large language model multimodal
29	Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning	提出结构感知Prompt适配方法，提升开放词汇组合零样本学习的泛化能力	open-vocabulary open vocabulary
30	Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding	提出知识增强的细粒度推理Agent(KFRA)，解决开放集细粒度视觉理解问题。	open-vocabulary open vocabulary multimodal
31	Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements	Yolo-Key-6D：基于关键点增强的单阶段单目6D位姿估计	6D pose estimation
32	Glass Segmentation with Fusion of Learned and General Visual Features	提出融合学习特征与通用视觉特征的玻璃分割网络，提升透明物体识别精度。	scene understanding foundation model	✅
33	ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training	ZipMap：线性时间、状态式三维重建与测试时训练	VGGT

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
34	ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors	ArtHOI：通过视频先验的4D重建合成可动的人-物交互	manipulation optical flow physically plausible
35	Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation	提出MMFA，通过无监督关键点定位实现可控人脸动画	manipulation representation learning
36	InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models	提出InEdit-Bench，用于评估图像编辑模型在中间逻辑路径上的推理能力。	manipulation multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
37	SimpliHuMoN: Simplifying Human Motion Prediction	SimpliHuMoN：提出一种简化的Transformer模型，用于人体运动预测，实现多任务SOTA。	human motion human motion prediction motion prediction
38	InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions	InfinityStory提出背景一致、角色感知的长视频生成框架，实现小时级叙事视频合成。	spatial relationship

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
39	NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction	NOVA3R：用于非像素对齐的Amodal 3D重建的视觉Transformer	physically plausible
40	Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance	提出基于嵌入式龙格-库塔引导的扩散采样方法，利用求解器误差提升图像生成质量。	classifier-free guidance	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页