EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Yehonathan Litman^1,2,*, Shikun Liu³, Dario Seyb¹, Nicholas Milef¹, Yang Zhou¹, Carl Marshall¹, Shubham Tulsiani², Caleb Leak¹

¹ Meta Reality Labs ² Carnegie Mellon University ³ Meta AI
^* Work done at Meta

Paper Code

Input

Mask

Full-Attention Baseline

EditCtrl

"Inpaint ocean."

480p

23.7%

74 sec

24 sec

"Change the donkey to a zebra."

480p

2.7%

74 sec

4 sec

"Add military chopper." "Add speedboat."

15.7%

OOM

70 sec

EditCtrl dynamically allocates compute proportional to the edit mask size for video editing, achieving considerable speedup over the full-attention baseline while maintaining or improving edit quality. Our method supports complex, prompt-guided edits on arbitrary resolution videos and can simultaneously handle multiple user-defined masks. Denoising latency is evaluated on a single RTX 6000 Ada card.

Abstract

High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this work, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10× more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention.

Method

EditCtrl edits a source video given user-specified edit masks and a text prompt. Content inside the edit masks is removed, yielding a background-only video. Two complementary context signals are then produced: (1) Local context where the foreground edit region and its immediate surroundings are encoded at full resolution, capturing fine-grained spatial detail where new content will be generated; (2) Global context in which the background video is down-sampled to a fixed compact resolution and encoded, providing scene-wide appearance and motion cues regardless of the original video resolution. These signals are fed to trainable local and global adapters inside a pretrained text-to-video diffusion model that denoises tokens only in the masked edit region given the text prompt. After diffusion, the generated tokens are scattered back into the source video latent. Because the diffusion process is carried only on masked foreground tokens, EditCtrl achieves a speedup proportional to the mask area.

Accelerating Generative Video Editing with Disentangled Local and Global Control

Given the source video $\mathbf{V}_\text{src}$ and target edit masks $\mathbf{V}_m$, we extract the background content $\mathbf{V}_b$ and encode it with a video VAE encoder $\mathcal{E}$. This is then concatenated channel-wise with the down-sampled masks to give the control context $\mathbf{C}$. Tokens in $\mathbf{C}$ outside the down-sampled edit mask region are then masked out, giving the local context tokens $\mathbf{C}_\text{local}$ which go to the local encoder module $c_\phi$, whose outputs are added to selected transformer layers. The global embedder $G_\psi$ receives the query feature tokens and global context tokens produced from the down-sampled background content $\mathbf{V}_b^\downarrow$ and modulates the noisy cross-attended features. By disentagling local and global control, we can achieve a significant speedup by performing the expensive local control computations on the masked edit region, while still maintaining video-wide appearance consistency with the lightweight global control.

Video Editing & Inpainting Results

We show results of EditCtrl on a variety of video editing tasks, including addition and removal, changing property attributes, and editing multiple regions with different prompts. Our method can handle high-resolution videos with large edit masks while maintaining temporal consistency and visual quality. Below are results showcasing the effectiveness of our approach in local editing and inpainting while harmonizing the generated content with the global video context.

Editing

Add a hot air balloon.

"Add UFO.", "Add fire-breathing monster.", "Add portal."

"Change the flower's color to blue."

"Remove the goat."

"Remove the curtain.", "Add a cat."

"Change the stairs to a waterfall."

"Change the sunglasses' color to shiny red."

"Add seagulls.", "Add cruise ship."

"Remove the Elephant."

"Change the cow's color to blue."

"Remove the cow."

Use the arrow buttons to navigate between different video editing comparisons. The edit instructions in the figures are abstractions of translated target captions.

Inpainting

Propagation

By defining the initial edit in one or more frames and propagating the mask, our approach can coherently generate the new content across subsequent frames in the local context in an autoregressive manner. Due to the high frame-rate nature of video feeds, the global context does not change much in the near future. Thus, we can treat the global embedding as a causal embedding by padding the global context with its own last available frames to provide adequate global context about the future. This forgoes the need for global context at inference time and allows us to edit the video into the future. The mask can also be propagated forward using motion cues such as optical flow or camera pose, enabling applications such as augmented reality editing where content needs to be generated before the headset acquires the frame and projected to the user once the frame is displayed.

"Add a HUD."

Initial Edit

Frames with Propagated Edit

EditCtrl propagates edits into future frames in real time. First, the initial frames are edited given a text prompt and mask. To continue the edit forward, optical flow warps the last edited frames to approximate upcoming ones, providing generation context alongside the propagated masks. When actual frames arrive, the generated content is composited into them and used as context for subsequent generations. This allows coherent, continuous editing without requiring access to future frames.

Quantitative Results

Dataset	Method	#Par.	PFLOPS↓	FPS↑	Text Alignment		Masked Region Preservation					Temp. Coherence
Dataset	Method	#Par.	PFLOPS↓	FPS↑	CLIP↑	CLIP (M)↑	PSNR↑	SSIM↑	LPIPS_×10²↓	MSE_×10²↓	MAE_×10²↓	CLIP Sim↓
VPBench-Edit	ReVideo	1.5B	193.39	0.11	9.34	20.01	15.52	0.49	27.68	3.49	11.14	0.42
	VideoPainter	5B	817.81	0.12	8.67	20.20	22.63	0.91	7.65	1.02	2.90	0.18
	VACE	1.3B	76.31	0.66	9.76	21.51	23.84	0.91	5.44	0.92	2.78	0.13
	VACE	14B	589.19	0.10	9.85	21.54	24.02	0.92	5.13	0.84	2.68	0.13
	EditCtrl	1.5B	17.42	4.67	9.58	21.70	24.16	0.92	5.54	0.99	3.01	0.15
	EditCtrl	16B	124.53	1.19	9.46	21.73	24.37	0.93	5.10	0.80	2.65	0.13

EditCtrl outperforms editing baselines on the VPBench-Edit test set as well as the full attention base model in edited video quality, background preservation, and content alignment with the text prompt, all at a much higher denoising throughput and lower computational cost needed to process the entire test set.

Dataset	Method	#Par.	FPS↑	Text Alignment		Masked Region Preservation					Temp. Coherence
Dataset	Method	#Par.	FPS↑	CLIP↑	CLIP (M)↑	PSNR↑	SSIM↑	LPIPS_×10²↓	MSE_×10²↓	MAE_×10²↓	CLIP Sim↓
VPBench-Inp	ProPainter	50M	5.34	7.31	17.18	20.97	0.87	9.89	1.24	3.56	0.44
	VideoPainter	5B	0.12	8.66	21.49	23.32	0.89	6.85	0.82	2.62	0.15
	VACE	1.3B	0.66	9.77	21.55	22.62	0.85	9.30	1.25	4.43	0.17
	VACE	14B	0.10	9.27	22.18	23.03	0.88	7.65	0.98	3.40	0.14
	EditCtrl	1.3B	5.24	9.81	21.86	23.17	0.86	8.52	1.29	3.91	0.17
	EditCtrl	14B	1.30	9.58	21.96	23.60	0.88	8.23	1.11	3.63	0.15
DAVIS	ProPainter	50M	5.51	7.54	16.69	23.99	0.92	5.86	0.98	2.48	0.12
	VideoPainter	5B	0.12	7.21	18.46	25.27	0.94	4.29	0.45	1.41	0.09
	VACE	1.3B	0.66	7.27	17.83	25.75	0.88	5.11	0.34	2.03	0.10
	VACE	14B	0.10	7.81	18.75	26.12	0.91	4.88	0.33	2.01	0.09
	EditCtrl	1.5B	5.57	7.33	18.02	25.44	0.86	5.31	0.39	2.01	0.12
	EditCtrl	16B	1.41	7.78	18.50	25.89	0.90	5.25	0.34	1.99	0.10

Citation

If you find EditCtrl useful for your research, please cite our paper:

@article{litman2026editctrl,
    title={EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing},
    author={Litman, Yehonathan and Liu, Shikun and Seyb, Dario and Milef, Nicholas and Zhou, Yang and Marshall, Carl and Tulsiani, Shubham and Leak, Caleb},
    journal={arXiv},
    year={2026}
}