Schrödinger Audio-Visual Editor (SAVE) presents a novel approach to object-level audio-visual removal. Our method enables precise removal of both visual objects and their corresponding sounds simultaneously, addressing a critical challenge in video editing.
Traditional video editing methods often struggle with joint audio-visual manipulation, either removing visual elements while leaving their sounds intact or vice versa. SAVE addresses this by:
Joint Audio-Visual Processing: Simultaneous removal of visual objects and their corresponding audio
Object-Level Control: Precise selection and removal of specific objects
High-Quality Results: Maintaining synchronization after removal
Generalization: Works across diverse real-world scenarios
Dataset Construction
In this section, we present **SAVEBench**, the first audio-visual paired dataset for object-level editing task
I. Dataset Pipeline
We show our synthetic pair-generation pipelines. For audio, we use Qwen-VL to enumerate sounding objects, then synthesize an object-centric track for each with MMAudio, and keep only tracks validated as clean by Qwen-Audio. We form (N!-!1) by mixing all retained tracks except one, and (N) by reintroducing the held-out track. For visual, we obtain object boxes with GroundingDINO and segmentation masks with SAM2. With the segmentation masks, we remove the corresponding objects with Inpaint-Anything on each frame of the video.
II. SAVEBench Examples
Example pairs from our SAVEBench dataset demonstrating various object removal scenarios.
Source
Target (Object Removed)
Sample 1: guitar
Sample 1: guitar removed
Sample 2: speaker
Sample 2: speaker removed
Sample 3: man
Sample 3: man removed
SAVE Removal Performance
In this section, we present some editing results with SAVE editor
Source
Target (Object Removed)
Sample 1: violin
Sample 1: violin removed
Sample 2: machine
Sample 2: machine removed
Sample 3: cat
Sample 3: cat removed
Generalizability
Examples showing SAVE's ability to generalize to diverse real-world scenarios
Single-Object Removal
Source
Output (SAVE)
Target Object: tree
tree removed
Target Object: orange
orange removed
Target Object: table
table removed
Multi-Object Removal
Source
Output (SAVE)
Target Objects: cat and dog
cat and dog removed
Interactive Demo
Coming Soon!
More Examples
Additional examples demonstrating the versatility and robustness of our audio-visual editor.