Schrödinger Audio-Visual Editor

Object-Level Audiovisual Removal
Weihan Xu*¹, Kan Jen Cheng*²·³, Koichi Saito, Muhammad Jehanzeb Mirza¹, Tingle Li²·³, Yisi Liu²·³, Alexander Liu¹,
Liming Wang¹, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji⁴·⁵, Gopala Anumanchipalli²·³, Paul Pu Liang¹
¹MIT
²UC Berkeley
³Berkeley AI Research
Sony AI
Sony Group Corporation
* Equal contribution

TL;DR: This is an interactive audio-visual editor for removing objects from audio and video in parallel.

Contents

  1. Introduction
  2. Dataset Construction
  3. SAVE Removal Performance
  4. Generalizability
  5. Interactive Demo
  6. More Examples

Introduction

Schrödinger Audio-Visual Editor (SAVE) presents a novel approach to object-level audio-visual removal. Our method enables precise removal of both visual objects and their corresponding sounds simultaneously, addressing a critical challenge in video editing.

SAVE Teaser

Traditional video editing methods often struggle with joint audio-visual manipulation, either removing visual elements while leaving their sounds intact or vice versa. SAVE addresses this by:

Dataset Construction

In this section, we present **SAVEBench**, the first audio-visual paired dataset for object-level editing task

I. Dataset Pipeline

We show our synthetic pair-generation pipelines. For audio, we use Qwen-VL to enumerate sounding objects, then synthesize an object-centric track for each with MMAudio, and keep only tracks validated as clean by Qwen-Audio. We form (N!-!1) by mixing all retained tracks except one, and (N) by reintroducing the held-out track. For visual, we obtain object boxes with GroundingDINO and segmentation masks with SAM2. With the segmentation masks, we remove the corresponding objects with Inpaint-Anything on each frame of the video.

Dataset Pipeline

II. SAVEBench Examples

Example pairs from our SAVEBench dataset demonstrating various object removal scenarios.

Source Target (Object Removed)
Sample 1: guitar
Sample 1: guitar removed
Sample 2: speaker
Sample 2: speaker removed
Sample 3: man
Sample 3: man removed

SAVE Removal Performance

In this section, we present some editing results with SAVE editor

Source Target (Object Removed)
Sample 1: violin
Sample 1: violin removed
Sample 2: machine
Sample 2: machine removed
Sample 3: cat
Sample 3: cat removed

Generalizability

Examples showing SAVE's ability to generalize to diverse real-world scenarios

Single-Object Removal

Source Output (SAVE)
Target Object: tree
tree removed
Target Object: orange
orange removed
Target Object: table
table removed

Multi-Object Removal

Source Output (SAVE)
Target Objects: cat and dog
cat and dog removed

Interactive Demo

Coming Soon!

More Examples

Additional examples demonstrating the versatility and robustness of our audio-visual editor.

Source SAVE (Ours)
Target Object: alarm clock
alarm clock removed
Source SAVE (Ours)
Target Object: plow
plow removed
Source SAVE (Ours)
Target Object: fire truck
fire truck removed
Source SAVE (Ours)
Target Object: man
man removed
Source SAVE (Ours)
Target Object: fireworks
fireworks removed
Source SAVE (Ours)
Target Object: motorcycle
motorcycle removed
Source SAVE (Ours)
Target Object: airplane
airplane removed
Source SAVE (Ours)
Target Object: bulldozer
bulldozer removed