icon RelightVid: Temporal-Consistent Diffusion Model for Video Relighting



Ye Fang1,2*Zeyi Sun1,3*Shangzhan Zhang4Tong Wu5Yinghao Xu5Pan Zhang1Jiaqi Wang1Gordon Wetzstein5Dahua Lin1,6

1Shanghai AI Laboratory     2Fudan University     3Shanghai Jiao Tong University      4Zhejiang University     
5Stanford University      6The Chinese University of Hong Kong     

* Equal Contribution

Video

Text-Conditioned Video Relighting

RelightVid relights the same input video under different text-based illumination prompts, offering fine-grained and consistent scene editing.

🔆 Full-Scene Relighting: Foreground and background are jointly relit to match the text prompt.

🎭 Foreground-Preserved Relighting: Background is inpainted while only the foreground is relit based on the prompt.




Background-Conditioned Video Relighting

RelightVid performs BG-conditioned relighting, dynamically influencing the illumination of the foreground to achieve coherent and context-aware results.

🎞️ Background Inputs: The first row shows different background videos serving as lighting conditions.

💡 Foreground Relighting: The following rows show the same foreground adaptively relighted under each background condition.


BG Video A

BG Video B

BG Video C





HDR-Conditioned Video Relighting

Videos generated by RelightVid maintain consistent object illumination over time by leveraging dynamic HDR-conditioned lighting. The first row shows temporal HDR environment maps as lighting conditions, while the second row presents the relighted results under each corresponding HDR.



Dataset Pipeline

LightAtlas data pipeline overview. Our custom augmentation pipeline generates high-quality video relighting pairs from both in-the-wild videos and 3D-rendered data. Left side extracts five types of augmented data from the original video using in-the-wild sources. Right side shows 3D-rendered video pairs generated under varying HDR maps and random camera trajectories, enhancing both diversity and realism of relighting data.


Model Architecture

RelightVid enhances video relighting by strategically inserting trainable temporal layers into the pretrained diffusion framework for image illumination editing (IC-Light), while effectively integrating background videos, text prompts and HDR maps through concatenation and illumination cross attention, enabling flexible and temporal-consistentl video relighting.


Citation
  @misc{fang2025relightvidtemporalconsistentdiffusionmodel,
    title={RelightVid: Temporal-Consistent Diffusion Model for Video Relighting}, 
    author={Ye Fang and Zeyi Sun and Shangzhan Zhang and Tong Wu and Yinghao Xu and Pan Zhang and Jiaqi Wang and Gordon Wetzstein and Dahua Lin},
    year={2025},
    eprint={2501.16330},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2501.16330}, 
}