RelightVid: Temporal-Consistent Diffusion Model for Video Relighting
* Equal Contribution
Text-Conditioned Video Relighting
RelightVid relights the same input video under different text-based illumination prompts, offering fine-grained and consistent scene editing.
🔆 Full-Scene Relighting: Foreground and background are jointly relit to match the text prompt.
🎭 Foreground-Preserved Relighting: Background is inpainted while only the foreground is relit based on the prompt.
Relighted Video A
sunlight filtering through trees, dappled light
Relighted Video B
natural lighting
sunlight filtering through trees, dappled light
overhead softbox lighting, clean shadows
glowing lava, fiery illumination
harsh industrial lighting, gritty
soft morning light, golden hour
Input Video
Relighted Video A
On a rainy day.
Relighted Video B
In the harsh, cold light of noon.
On a rainy day.
At night with moonlight.
a car driving on the street, neon light
a car driving on the beach, sunset over sea
Background-Conditioned Video Relighting
RelightVid performs BG-conditioned relighting, dynamically influencing the illumination of the foreground to achieve coherent and context-aware results.
🎞️ Background Inputs: The first row shows different background videos serving as lighting conditions.
💡 Foreground Relighting: The following rows show the same foreground adaptively relighted under each background condition.
HDR-Conditioned Video Relighting
Videos generated by RelightVid maintain consistent object illumination over time by leveraging dynamic HDR-conditioned lighting. The first row shows temporal HDR environment maps as lighting conditions, while the second row presents the relighted results under each corresponding HDR.
Input Video/OBJ.
Input Video/OBJ.
Dataset Pipeline
LightAtlas data pipeline overview. Our custom augmentation pipeline generates high-quality video relighting pairs from
both in-the-wild videos and 3D-rendered data. Left side extracts five types of augmented data
from the original video using in-the-wild sources.
Right side shows 3D-rendered video pairs
generated under varying HDR maps and random camera trajectories, enhancing both diversity and realism of relighting data.
Model Architecture
RelightVid enhances video relighting by strategically inserting trainable temporal layers into the pretrained diffusion framework for image illumination editing (IC-Light),
while effectively integrating background videos, text prompts and HDR maps through concatenation and illumination cross attention, enabling flexible and temporal-consistentl video relighting.
Citation
@misc{fang2025relightvidtemporalconsistentdiffusionmodel,
title={RelightVid: Temporal-Consistent Diffusion Model for Video Relighting},
author={Ye Fang and Zeyi Sun and Shangzhan Zhang and Tong Wu and Yinghao Xu and Pan Zhang and Jiaqi Wang and Gordon Wetzstein and Dahua Lin},
year={2025},
eprint={2501.16330},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.16330},
}