RelightVid: Temporal-Consistent Diffusion Model for Video Relighting

Ye Fang^1,2* Zeyi Sun^1,3* Shangzhan Zhang⁴ Tong Wu⁵ Yinghao Xu⁵ Pan Zhang¹ Jiaqi Wang¹ Gordon Wetzstein⁵ Dahua Lin^1,6

¹Shanghai AI Laboratory ²Fudan University ³Shanghai Jiao Tong University ⁴Zhejiang University
⁵Stanford University ⁶The Chinese University of Hong Kong

* Equal Contribution

Video

Text-Conditioned Video Relighting

RelightVid relights the same input video under different text-based illumination prompts, offering fine-grained and consistent scene editing.

🔆 Full-Scene Relighting: Foreground and background are jointly relit to match the text prompt.

🎭 Foreground-Preserved Relighting: Background is inpainted while only the foreground is relit based on the prompt.

Input Video

Relighted Video A

sunlight filtering through trees, dappled light

Relighted Video B

natural lighting

sunlight filtering through trees, dappled light

natural lighting

overhead softbox lighting, clean shadows

glowing lava, fiery illumination

harsh industrial lighting, gritty

soft morning light, golden hour

Input Video

Relighted Video A

On a rainy day.

Relighted Video B

In the harsh, cold light of noon.

On a rainy day.

At night with moonlight.

In the late evening.

a car driving on the street, neon light

a car driving on the beach, sunset over sea

Background-Conditioned Video Relighting

RelightVid performs BG-conditioned relighting, dynamically influencing the illumination of the foreground to achieve coherent and context-aware results.

🎞️ Background Inputs: The first row shows different background videos serving as lighting conditions.

💡 Foreground Relighting: The following rows show the same foreground adaptively relighted under each background condition.

BG Video A

BG Video B

BG Video C

HDR-Conditioned Video Relighting

Videos generated by RelightVid maintain consistent object illumination over time by leveraging dynamic HDR-conditioned lighting. The first row shows temporal HDR environment maps as lighting conditions, while the second row presents the relighted results under each corresponding HDR.

Input Video/OBJ.

Temporal HDR A

Temporal HDR B

Input Video/OBJ.

Temporal HDR A

Temporal HDR B

Dataset Pipeline

LightAtlas data pipeline overview. Our custom augmentation pipeline generates high-quality video relighting pairs from both in-the-wild videos and 3D-rendered data. Left side extracts five types of augmented data from the original video using in-the-wild sources. Right side shows 3D-rendered video pairs generated under varying HDR maps and random camera trajectories, enhancing both diversity and realism of relighting data.

Model Architecture

RelightVid enhances video relighting by strategically inserting trainable temporal layers into the pretrained diffusion framework for image illumination editing (IC-Light), while effectively integrating background videos, text prompts and HDR maps through concatenation and illumination cross attention, enabling flexible and temporal-consistentl video relighting.

Citation

  @misc{fang2025relightvidtemporalconsistentdiffusionmodel,
    title={RelightVid: Temporal-Consistent Diffusion Model for Video Relighting}, 
    author={Ye Fang and Zeyi Sun and Shangzhan Zhang and Tong Wu and Yinghao Xu and Pan Zhang and Jiaqi Wang and Gordon Wetzstein and Dahua Lin},
    year={2025},
    eprint={2501.16330},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2501.16330}, 
}