V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

1Fudan University, 2Adobe Research, 3Stanford University
Fudan University Adobe Research Stanford University

Abstract

Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored.

We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality.

Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

Video

Method Overview

V-RGBX Method Overview

Model architecture. V-RGBX consists of three parts: (1) Inverse Renderer D, which decomposes the input video into albedo, normal, material, and irradiance channels; (2) Intrinsic Conditioning Sampler S, which interleaves edited keyframe intrinsics with non-conflicted random intrinsic frames to form a unified intrinsic conditioning video; and (3) Forward Renderer R, which integrates the intrinsic video, keyframe reference, and temporal-aware intrinsic embeddings to synthesize the output RGB video and consistently propagate intrinsic properties across time.

Editing Workflow

V-RGBX Editing Workflow

Intrinsic-aware video editing workflow of V-RGBX. Given an input video and edited keyframes, we decompose them into intrinsic channels, and the intrinsic conditioning sampler uses these representations to produce an intrinsic video. The forward renderer then synthesizes the final edited sequence using both the intrinsic video and the appearance cues provided by the edited keyframes.

Visual Results

Intrinsic-aware Editing

V-RGBX converts solid color, texture, material, and normal edits on a keyframe into intrinsic-space modifications. These intrinsic cues enable reliable propagation across the whole sequence. The final video presents consistent and realistic edits in all frames.

Video Relighting

V-RGBX maps light-color and shadow edits to intrinsic irradiance changes. These intrinsic cues drive consistent propagation across the whole sequence. The final video shows stable and realistic relighting effects in all frames.

Multi-attributes Editing

V-RGBX combines multiple intrinsic edits, such as albedo and irradiance modifications, within a keyframe. These joint intrinsic cues guide consistent propagation across the sequence. The final video shows coherent and realistic multi-attribute edits in all frames.

In-the-wild Generalization

V-RGBX handles challenging in-the-wild scenes by converting keyframe edits—even with complex textures and lighting—into intrinsic modifications. These intrinsic cues enable stable propagation despite real-world appearance variations.

Video RGB ↔ X Transition

RGB2X Inverse Rendering

V-RGBX decomposes real-world videos into intrinsic representations including albedo, normal, irradiance, and material. This RGB2X transition provides a structured scene understanding that supports reliable, controllable video editing.

X2RGB Forward Rendering

V-RGBX reconstructs realistic RGB videos from intrinsic inputs such as albedo, normal, irradiance, and material. This forward rendering preserves scene lighting, geometry, and material appearance. We compare against existing methods visually.

Video RGB ↔ X Cycle Consistency

V-RGBX performs a full RGB→X→RGB cycle by first decomposing the input video into intrinsic representations and then reconstructing it through the forward renderer. This cycle evaluates whether intrinsic decomposition and synthesis remain consistent with the original video. We compare against existing methods visually.

BibTeX

@misc{fang2025vrgbxvideoeditingaccurate,
      title={V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties}, 
      author={Ye Fang and Tong Wu and Valentin Deschaintre and Duygu Ceylan and Iliyan Georgiev and Chun-Hao Paul Huang and Yiwei Hu and Xuelin Chen and Tuanfeng Yang Wang},
      year={2025},
      eprint={2512.11799},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.11799}, 
}